They may be presented as ‘state-of-the-art’ artificial intelligence systems, but large language models (LLMs) are poor medical coders. This is according to researchers at the Icahn School of Medicine at Mount Sinai.
A new study emphasizes the necessity for refinement and validation of these technologies before considering clinical implementation.
The study extracted a list of more than 27,000 unique diagnosis and procedure codes from 12 months of routine care in the Mount Sinai Health System, while excluding identifiable patient data. Using the description for each code, the researchers prompted models from OpenAI, Google, and Meta to output the most accurate medical codes. The generated codes were compared with the original codes and errors were analyzed for any patterns.
The investigators reported that all of the studied large language models, including GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, showed limited accuracy (below 50 percent) in reproducing the original medical codes, highlighting a significant gap in their usefulness for medical coding.
Of the different technologies assessed, GPT-4 demonstrated the best performance, with the highest exact match rates for ICD-9-CM (45.9 percent), ICD-10-CM (33.9 percent), and CPT codes (49.8 percent).
GPT-4 also produced the highest proportion of incorrectly generated codes that still conveyed the correct meaning. For example, when given the ICD-9-CM description “nodular prostate without urinary obstruction,” GPT-4 generated a code for “nodular prostate,” showcasing its comparatively nuanced understanding of medical terminology.
However, even considering these technically correct codes, an unacceptably large number of errors remained.
The next best-performing model, GPT-3.5, had the greatest tendency toward being vague. It had the highest proportion of incorrectly generated codes that were accurate but more general in nature compared to the precise codes. In this case, when provided with the ICD-9-CM description “unspecified adverse effect of anesthesia,” GPT-3.5 generated a code for “other specified adverse effects, not elsewhere classified.”
These results indicate that further refinement is required before deploying AI technologies in sensitive operational areas like medical coding.
With the current state of the technology, the researchers recommend that LLMs are combined with expert knowledge to automate medical code extraction, potentially enhancing billing accuracy and reducing administrative costs in health care.
The research appears in the New England Journal of Medicine, titled “Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying.”