New Research Examines Multi-Agent Collaboration Approaches for Large Language Models

Three recent arXiv papers explore different aspects of using multiple AI agents and models together, revealing both limitations and opportunities.

According to arXiv:2601.19921v1, multi-agent debate (MAD) is widely used to improve large language model performance through test-time scaling. However, the research shows that “vanilla MAD often underperforms simple majority vote despite higher computational cost,” challenging assumptions about this popular approach.

A separate paper (arXiv:2601.20617v1) examines LLM agents in public sector applications, finding that “deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions.” The research indicates that current agent benchmarks fail to adequately address these requirements.

Meanwhile, arXiv:2512.23340v2 investigates “The Law of Multi-Model Collaboration,” exploring scaling limits when multiple LLMs work together through ensembling. The paper notes that while “recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models,” the capabilities of single LLMs have inherent limitations, prompting investigation into collaborative approaches.

These studies collectively highlight ongoing efforts to understand how multiple AI models can effectively work together, while identifying current limitations in both methodology and evaluation frameworks.