Connect with us

Hi, what are you looking for?

Tech & Science

Researchers seek to assess the ‘reasoning power’ of AI scientists

Error theory can mean vastly different things depending on the context – what does it mean in the context of science and AI?

A scientist is performing microbial colony counting. Image by Tim Sandle
A scientist is performing microbial colony counting. Image by Tim Sandle

A scientist based at the University of Exeter is seeking to ensure AI can better help other scientists to safely make the correct decisions and hence to help to herald in rapid advances in vital societal fields, including such as drug discovery, producing meaningful results at a fraction of the current costs – due to the potential benefits of artificial intelligence.

To aid this endeavour, Stephan Guttinger has secured a Research Leadership Award from the Leverhulme Trust to explore the reasoning abilities of “AI Scientists”. These ‘scientists’ are not people working in the field, rather this refers to AI-based systems that can autonomously perform research tasks and solve scientific problems. Such systems are types of emerging, AI-driven software systems that are designed to autonomously conduct the entire scientific research process. These software-based agents, often powered by Large Language Models, can come up with ideas, review literature, write code, run experiments, and create scientific papers.

Addressing experimental error solutions

It is hoped that, in the longer-term, the deployment of these systems will support research. However, to work autonomously these new tools need to be able to resolve the most common problem in everyday research practice: This is dealing with and identifying experimental error and suggesting possible solutions.

Dr Guttinger, a lecturer in Philosophy of Data at the Department of Social and Political Sciences, Philosophy and Anthropology, has been awarded the Leverhulme funding in order to assemble an interdisciplinary team who have been given the brief to explore the error-reasoning ability of AI Scientists.

The four-year project will bring together experts in philosophy, the natural sciences, and computer science. At the heart of the project is the development of a benchmark that can systematically probe how well current or future AI models can deal with error in scientific practice. The benchmark is needed as it is not clear how good existing AI systems are at scientific error-reasoning.

Dr Guttinger says: “Scientific error-reasoning has not been widely or deeply datafied: Scientists work through errors in weekly laboratory meetings, on whiteboards, or in the hallways of a conference venue. These discussions rarely find their way into published materials and are thus underrepresented in the data on which AI models are trained.”

As to what needs to be done, Guttinger explains: “To address this uncertainty, we need benchmarks that allow us to assess the extent to which AI models can reason about scientific error. However, even our most sophisticated benchmarks for AI agents don’t currently test for this type of reasoning.”

He adds that a key challenge for building an error-reasoning benchmark is the lack of a well-developed theory of error.

Guttinger clarifies: “Developing effective benchmarks requires a good understanding of error-reasoning in science: what are the types of errors scientists encounter and what are the strategies they usually deploy to address them? Unfortunately, we still lack a systematic and comprehensive theory of error in science.”

An error theory for science

The first goal of the project is to to build a detailed error theory for science, which the team will then use to assemble a systematic database of error types and strategies in science.

What is scientific error? Here, there are contrasting approaches. One type of error, say University of Groningen researchers, results from bias and influences scientific output through factors not related to scientific content, but through extraneous factors such as career prospects, funding opportunities and the peer-review process. The other type of error results from mistakes and influences scientific output through inaccuracies and mistakes in the research process itself.

This database will be used to develop two benchmarks: a traditional benchmark, containing more than 500 question-answer pairs that can be used to test the error-reasoning ability of isolated AI agents. Another will be designed to assess human-AI teams. This will allow them to assess different aspects of the error-reasoning process in science.

Guttinger concludes: “Our goal is to develop the conceptual and mathematical tools, as well as the data that we need to assess how AI Scientists work through the problem of error, be that on their own or in collaboration with human researchers. The project will thus establish the foundations required for the reliable and trustworthy development of scientific AI agents.”

Pinpointing error is something for the general (scientific) good. The quality of knowledge deepens—even if some earlier concepts are abandoned as “wrong.”

Avatar photo
Written By

Dr. Tim Sandle is Digital Journal's Editor-at-Large for science news. Tim specializes in science, technology, environmental, business, and health journalism. He is additionally a practising microbiologist; and an author. He is also interested in history, politics and current affairs.

You may also like:

Business

Do NOT trust AI coding to be some sort of fairy god-agent for your business. Check everything ruthlessly.

Business

Digital Journal dives into new findings from a survey of automotive dealers re: their sentiment on fraud.

Tech & Science

OpenClaw, created in November by an Austrian coder, differs from bots like ChatGPT because it can execute real-life tasks.

Business

Why C-suite leaders who last rely less on brilliance and more on adaptability