Three New arXiv Papers Present Benchmarks and Frameworks for Evaluating AI Systems

Three recent papers on arXiv introduce new evaluation methods for large language models across different domains.

Medical AI Evaluation

According to arXiv paper 2509.02594v2, researchers have applied “OpenAI’s HealthBench” to evaluate an LLM-based medical assistant. The paper focuses on assessing how these systems handle “complex, high-s[takes]” clinical questions, moving “beyond conventional benchmarks” to evaluate situational awareness in clinical contexts.

AI Research Agent Benchmark

ArXiv paper 2602.15112v1 introduces ResearchGym, described as “a benchmark and execution environment for evaluating AI agents on end-to-end research.” According to the abstract, the researchers “repurpose five oral and spotlight papers from ICML, ICLR, and ACL,” preserving data from each paper’s repository to create the evaluation framework.

Automated Theorem Proving

ArXiv paper 2506.19923v5 presents Prover Agent, “a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean.” According to the paper, Prover Agent coordinates “an informal reasoning LLM, a formal prover model, and feedback from L[ean]” to approach mathematical proof generation.

All three papers represent ongoing efforts to develop specialized evaluation methods for AI systems in technical domains.