Three New Benchmarks Released for Evaluating LLM Capabilities in Commerce, Tool Use, and Building Management

Researchers published three arXiv papers introducing benchmarks for testing LLMs in e-commerce search, tool usage via MCP, and smart building energy systems.

Three New Benchmarks Released for Evaluating LLM Capabilities

Researchers have published three separate papers on arXiv introducing new benchmarks for evaluating large language model (LLM) performance across different domains.

E-commerce Search Relevance

According to arXiv paper 2512.24943v1, titled “RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment,” existing benchmarks for web e-commerce search relevance “lack sufficient complexity for comprehensive model assessment.” The paper introduces RAIR, which focuses on long-tail scenarios and visual salience in e-commerce relevance tasks.

MCP Tool Usage

A second paper (arXiv:2512.24565v1) presents MCPAgentBench, described as “A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use.” The research addresses what the authors identify as issues with current Model Context Protocol (MCP) evaluation sets, which they state “suffer from issues such as reliance on ex[ternal factors].” The benchmark is designed to assess LLMs functioning as autonomous agents using external tools via MCP.

Smart Building Energy Management

The third paper (arXiv:2512.25055v1) presents a “conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents.” According to the abstract, the system aims to “facilitate context-aware energy management in smart buildings through natural language inte[rface].”