Scientists from University of California – San Francisco have been testing whether generative AI can handle complex medical datasets at the same performance level as human experts. This adds to research exploring the advantages presented by AI in medicine.
The results showed how, in some cases, the AI matched or outperformed teams that had spent months building prediction models. By generating usable analytical code from precise prompts, the AI systems reduced the time needed to process health data. The findings hint at a future where AI helps scientists move faster from data to discovery.
AI vs human
To compare performance directly, researchers assigned identical tasks to different groups. Some teams relied entirely on human expertise, while others used scientists working with AI tools. Several AI tools were examined.
The challenge was to predict preterm birth using data from more than 1,000 pregnant women.
The AI system generated functioning computer code in minutes — something that would normally take experienced programmers several hours or even days.
The advantage came from AI’s ability to write analytical code based on short but highly specific prompts. Not every system performed well. Only 4 of the 8 AI chatbots tested produced usable code. Still, those that succeeded did not require large teams of specialists to guide them.
Why this matters
Speeding up data analysis could improve diagnostic tools for preterm birth — the leading cause of new born death and a major contributor to long-term motor and cognitive challenges in children. In the U.S., around 1,000 babies are born prematurely each day.
Researchers still do not fully understand what causes preterm birth. To investigate possible risk factors, Sirota’s team compiled microbiome data from about 1,200 pregnant women whose outcomes were tracked across nine separate studies.
Advanced AI
To develop AI capable of analysing a vast and complex dataset – such as all pregnancy details across a period of time for the entire U.S. – proved challenging. To tackle this, the researchers turned to a global crowdsourcing competition called DREAM (Dialogue on Reverse Engineering Assessment and Methods).
One of the DREAM pregnancy challenges focused specifically on vaginal microbiome data. More than 100 teams worldwide participated, developing machine learning models designed to detect patterns linked to preterm birth. Most groups completed their work within the three month competition window. Yet it took nearly two years to consolidate the findings and publish them.
To determine if generative AI could shorten that timeline, the researchers instructed eight AI systems to independently generate algorithms using the same datasets from the three DREAM challenges, without direct human coding.
The AI chatbots received carefully written natural language instructions. Much like ChatGPT, the systems were guided through detailed prompts designed to steer them toward analysing the health data in ways comparable to the original DREAM participants.
The AI systems analysed vaginal microbiome data to identify signs of preterm birth and examined blood or placental samples to estimate gestational age. Pregnancy dating is almost always an estimate, yet it determines the type of care women receive as their pregnancies progress. When estimates are inaccurate, preparing for labour becomes more difficult.
Researchers then ran the AI generated code using the DREAM datasets. Only 4 of the 8 tools produced models that matched the performance of the human teams, although in some cases the AI models performed better. The entire generative AI effort — from inception to submission of a paper — took just six months.
Human oversight remains necessary
The scientists emphasise that AI still requires careful oversight. These systems can produce misleading results, and human expertise remains essential. However, by rapidly sorting through massive health datasets, generative AI may allow researchers to spend less time troubleshooting code and more time interpreting results and asking meaningful scientific questions.
The research appears in the journal Cell Reports Medicine, titled “Benchmarking large language models for predictive modeling in biomedical research with a focus on reproductive health.”
