New Studies Examine Safety and Reliability of AI Chatbots in Healthcare and Mental Health

Three new research papers have been published evaluating the use of large language models in clinical settings, with particular focus on safety and reliability.

According to a paper published on arXiv (2602.05381v1), researchers conducted a clinical validation of medical-based large language model chatbots specifically for ophthalmic patient queries. The study evaluated four small medical domain-specific language models, noting that “rigorous evaluation essential to ensure safety and accuracy” as these models are “increasingly used to support patient education, triage, and clinical decision making in ophthalmology.”

A separate study (arXiv:2507.13579v3) examined methods for personalizing AI assistants through reinforcement learning from human feedback, addressing the growing need to “personalize responses to align to different users’ preferences and goals” as LLM use cases expand.

Most notably, researchers introduced VERA-MH (arXiv:2602.05088v1), described as an open-source AI safety evaluation tool for mental health applications. The paper states that “millions now use leading generative AI chatbots for psychological support,” making safety evaluation “the single most pressing question in AI for mental health.” The study focuses on validating the reliability and validity of this safety assessment framework.