Three New Benchmarks Released for Evaluating Multimodal AI Agents and Language Models

Researchers have published three new benchmarks on arXiv aimed at evaluating different capabilities of large language models and multimodal systems.

MLLM-CTBench (arXiv:2508.08275v3) focuses on continual instruction tuning for multimodal large language models (MLLMs). According to the abstract, the benchmark addresses “continual instruction tuning during the post-training phase” which is “crucial for adapting multimodal large language models to evolving real-world demands.” The researchers note that progress has been “hampered by the lack of benchmarks with rigorous” evaluation methods, though the abstract appears truncated.

BrowseComp-V³ (arXiv:2602.12876v1) introduces what the authors describe as “a Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents.” The paper states that multimodal large language models “equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.”

GISA (arXiv:2602.08543v2) presents “a Benchmark for General Information-Seeking Assistant.” According to the abstract, “the advancement of large language models has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions,” prompting the need for comprehensive evaluation benchmarks.

All three papers are cross-listed in the cs.AI category on arXiv.