Three New Benchmarks Test LLMs on Planning, Video Understanding, and Mathematical Reasoning

Researchers introduce benchmarks evaluating LLM capabilities in Wikipedia navigation, temporal video reasoning, and formal mathematical theorem proving.

Three New Benchmarks Test LLMs on Planning, Video Understanding, and Mathematical Reasoning

Researchers have released three new benchmarks designed to evaluate different aspects of large language model capabilities, according to recent arXiv preprints.

LLM-WikiRace (arXiv:2602.16902v2) introduces a benchmark for assessing planning, reasoning, and world knowledge in LLMs. According to the paper, models must “efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given” starting point, testing their ability to plan over real-world knowledge graphs.

TimeBlind (arXiv:2602.00288v2) addresses spatio-temporal understanding in video-based language models. The benchmark focuses on temporal dynamics, an area where current Multimodal Large Language Models (MLLMs) show weaknesses. According to the abstract, while MLLMs “master static semantics, their grasp of temporal dynamics remains brittle,” making fine-grained spatio-temporal understanding essential for video reasoning and embodied AI applications.

FATE (arXiv:2511.02872v3) provides a formal benchmark series for frontier algebra across multiple difficulty levels. The researchers note that while LLMs have shown “impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO,” these contests “do not reflect the depth, breadth” of mathematical reasoning, motivating the need for more comprehensive evaluation tools.

These benchmarks aim to provide more nuanced assessments of LLM capabilities beyond standard performance metrics.