Three New Studies Examine Safety and Validation of AI Systems in Healthcare Applications

Three recent arXiv preprints address critical evaluation challenges for large language models (LLMs) deployed in medical and mental health contexts.

According to arXiv:2602.05381v1, researchers conducted a “clinical validation” study evaluating four small medical domain-specific LLMs on ophthalmic patient queries, utilizing LLM-based evaluation methods. The study focused on models increasingly used “to support patient education, triage, and clinical decision making in ophthalmology,” with the authors emphasizing that “rigorous evaluation essential to ensure safety and accuracy.”

A separate study (arXiv:2507.13579v3) examined personalization in AI assistants, developing methods for “learning to summarize user information for personalized reinforcement learning from human feedback.” The researchers noted that as “everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals.”

In mental health applications, researchers introduced VERA-MH (arXiv:2602.05088v1), described as an “open-source AI safety evaluation” framework. According to the abstract, “millions now use leading generative AI chatbots for psychological support,” making safety assessment “the single most pressing question in AI for mental health.” The study presents reliability and validity data for this evaluation tool.

All three papers remain preprints and have not yet undergone peer review.