Connect with us

Hi, what are you looking for?

Tech & Science

Despite AI advancements, human oversight remains essential

GPT-4 produced the highest proportion of incorrectly generated codes that still conveyed the correct meaning.

A medical employee walks in a corridor of a Monkeypox vaccination site in Paris in August 2022
A medical employee walks in a corridor of a Monkeypox vaccination site in Paris in August 2022 - Copyright AFP/File JULIEN DE ROSA
A medical employee walks in a corridor of a Monkeypox vaccination site in Paris in August 2022 - Copyright AFP/File JULIEN DE ROSA

They may be presented as ‘state-of-the-art’ artificial intelligence systems, but large language models (LLMs) are poor medical coders. This is according to researchers at the Icahn School of Medicine at Mount Sinai.

A new study emphasizes the necessity for refinement and validation of these technologies before considering clinical implementation.

The study extracted a list of more than 27,000 unique diagnosis and procedure codes from 12 months of routine care in the Mount Sinai Health System, while excluding identifiable patient data. Using the description for each code, the researchers prompted models from OpenAI, Google, and Meta to output the most accurate medical codes. The generated codes were compared with the original codes and errors were analyzed for any patterns.

The investigators reported that all of the studied large language models, including GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, showed limited accuracy (below 50 percent) in reproducing the original medical codes, highlighting a significant gap in their usefulness for medical coding.

Of the different technologies assessed, GPT-4 demonstrated the best performance, with the highest exact match rates for ICD-9-CM (45.9 percent), ICD-10-CM (33.9 percent), and CPT codes (49.8 percent).

GPT-4 also produced the highest proportion of incorrectly generated codes that still conveyed the correct meaning. For example, when given the ICD-9-CM description “nodular prostate without urinary obstruction,” GPT-4 generated a code for “nodular prostate,” showcasing its comparatively nuanced understanding of medical terminology.

However, even considering these technically correct codes, an unacceptably large number of errors remained.

The next best-performing model, GPT-3.5, had the greatest tendency toward being vague. It had the highest proportion of incorrectly generated codes that were accurate but more general in nature compared to the precise codes. In this case, when provided with the ICD-9-CM description “unspecified adverse effect of anesthesia,” GPT-3.5 generated a code for “other specified adverse effects, not elsewhere classified.”

These results indicate that further refinement is required before deploying AI technologies in sensitive operational areas like medical coding.

With the current state of the technology, the researchers recommend that LLMs are combined with expert knowledge to automate medical code extraction, potentially enhancing billing accuracy and reducing administrative costs in health care.

The research appears in the New England Journal of Medicine, titled “Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying.”

Avatar photo
Written By

Dr. Tim Sandle is Digital Journal's Editor-at-Large for science news. Tim specializes in science, technology, environmental, business, and health journalism. He is additionally a practising microbiologist; and an author. He is also interested in history, politics and current affairs.

You may also like:

Entertainment

‘The Deb’ is a musical comedy about a small-town teenager trying to find a date for the debutante ball

Tech & Science

The process involves the use of microsecond-scale, high-voltage electrical fields to cause irreversible electroporation and destabilization of cell membranes.

Tech & Science

The worst offender was found to be the MyJio app, which asks for 29 permissions.

Business

The Federal Reserve is gearing up to announce its first interest rate cut for more than four years on Wednesday.