Three New Studies Examine Limitations and Risks of Large Language Models in Medical AI

Recent arXiv papers evaluate medical AI models, identify ineffective LLM layers, and test clinical decision-making vulnerabilities.

Three new papers on arXiv explore critical aspects of AI models in healthcare applications and fundamental LLM architecture issues.

Medical AI Model Comparison

According to arXiv:2601.16549v1, researchers conducted “a rigorous, unified benchmark” comparing multimodal Vision-Language Models (VLMs) and Large Language Models (LLMs) against traditional machine learning approaches for medical classification tasks. The study utilized four publicly available datasets covering both text and image-based medical data, though the paper abstract does not specify which approach performed better.

The Curse of Depth in LLMs

A separate study (arXiv:2502.05795v4) introduces the concept of “Curse of Depth,” which “highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected,” according to the researchers. The paper confirms this phenomenon exists in current LLM architectures.

Clinical Decision-Making Vulnerabilities

Researchers developed SycoEval-EM (arXiv:2601.16529v1), “a multi-agent simulation framework evaluating LLM robustness through adversarial patient” interactions in emergency care scenarios. According to the paper, the framework addresses concerns that LLMs “risk acquiescing to patient pressure for inappropriate care” in clinical decision support applications.