Mixed-Vendor Multi-Agent LLMs Show Superior Clinical Diagnosis Performance

A new study demonstrates that combining large language models from different vendors significantly improves clinical diagnostic accuracy compared to using models from a single provider.

According to research published on arxiv.org and accepted for presentation at the EACL 2026 Workshop on Healthcare and Language Learning, researchers compared Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks for clinical diagnosis. The study used three doctor agents powered by o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, evaluating their performance on RareBench and DiagnosisArena benchmarks.

The research found that “mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy,” according to the arxiv.org paper. The study’s overlap analysis revealed the underlying mechanism: “mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss.”

The researchers attribute superior performance to vendor diversity reducing correlated failure modes. Single-vendor teams, which rely on multiple agents from the same model family, “risk correlated failure modes that reinforce shared biases rather than correcting them,” according to the paper.

The findings highlight vendor diversity as “a key design principle for robust clinical diagnostic systems,” suggesting that healthcare AI applications may benefit from incorporating models from multiple providers rather than relying on a single vendor’s technology.