AI models, including ChatGPT, can make basic errors when navigating ethical medical decisions, as a new study reveals. In this investigation, researchers tweaked familiar ethical dilemmas and discovered that AI often defaulted to intuitive but incorrect responses—sometimes ignoring updated facts. This brings to the fore quality assurance concerns.
The findings raise serious concerns about using AI for high-stakes health decisions and underscore the need for human oversight, especially when ethical nuance or emotional intelligence is involved.
The research team ,from Icahn School of Medicine at Mount Sinai, was inspired by Daniel Kahneman’s book “Thinking, Fast and Slow,” which contrasts fast, intuitive reactions with slower, analytical reasoning. It has been observed that large language models (LLMs) falter when classic lateral-thinking puzzles receive subtle tweaks. Building on this insight, the study tested how well AI systems shift between these two modes when confronted with well-known ethical dilemmas that had been deliberately tweaked.
The Kahneman book’s main thesis is a differentiation between two modes of thought: “System 1” is fast, instinctive and emotional; “System 2” is slower, more deliberative, and more logical. From framing choices to people’s tendency to replace a difficult question with one that is easy to answer, the book summarizes several decades of research to suggest that people have too much confidence in human judgment
According to lead researcher Eyal Klang: “AI can be very powerful and efficient, but our study showed that it may default to the most familiar or intuitive answer, even when that response overlooks critical details,In everyday situations, that kind of thinking might go unnoticed. But in health care, where decisions often carry serious ethical and clinical implications, missing those nuances can have real consequences for patients.”
Experiment 1 – Gender bias
To explore this tendency, the research team tested several commercially available LLMs using a combination of creative lateral thinking puzzles and slightly modified well-known medical ethics cases. In one example, they adapted the classic “Surgeon’s Dilemma,” a widely cited 1970s puzzle that highlights implicit gender bias. In the original version, a boy is injured in a car accident with his father and rushed to the hospital, where the surgeon exclaims, “I can’t operate on this boy — he’s my son!”
The twist is that the surgeon is his mother, though many people don’t consider that possibility due to gender bias. In the researchers’ modified version, they explicitly stated that the boy’s father was the surgeon, removing the ambiguity. Even so, some AI models still responded that the surgeon must be the boy’s mother. The error reveals how LLMs can cling to familiar patterns, even when contradicted by new information.
Experiment 2 – Refusing a life saving blood transfusion
In another example to test whether LLMs rely on familiar patterns, the researchers drew from a classic ethical dilemma in which religious parents refuse a life-saving blood transfusion for their child. Even when the researchers altered the scenario to state that the parents had already consented, many models still recommended overriding a refusal that no longer existed.
The findings do not suggest that AI has no place in medical practice, yet they do highlight the need for thoughtful human oversight, especially in situations that require ethical sensitivity, nuanced judgment, or emotional intelligence.
The research team plans to expand their work by testing a wider range of clinical examples. They’re also developing an “AI assurance lab” to systematically evaluate how well different models handle real-world medical complexity.
The research is published in the journal npj Digital Medicine and it is titled “Pitfalls of large language models in medical ethics reasoning.”
