AI remains lacking in clinical reasoning abilities, according to study of 21 large language models

刊登時間 04/13/2026

Despite increasing use of artificial intelligence (AI) in health care, a new study led by Mass General Brigham researchers from the MESH Incubator shows that generative AI models continue to fall short in their clinical reasoning capabilities.

In the new study, the researchers developed a novel and more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a model's competency across different stages of clinical reasoning—coming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment.

The study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission.

The researchers tested the models' ability to work through 29 published clinical cases.

In line with their previous study, the researchers found that the LLMs were good at producing accurate final diagnoses.

However, all of the models failed to produce an appropriate differential diagnosis more than 80% of the time.

【MORE】