Three New Benchmarks Examine LLM Capabilities in Forecasting, Physics, and Moral Reasoning

Three recent papers on arXiv introduce new evaluation frameworks for testing large language models across diverse domains.

Context-Aided Forecasting: A paper titled “Beyond Naïve Prompting: Strategies for Improved Context-aided Forecasting with LLMs” (arXiv:2508.09904v2) examines how LLMs can integrate textual contextual information alongside historical data for real-world forecasting tasks, according to the abstract. The research addresses “critical challenges” in this area, though specific findings are not detailed in the provided excerpt.

Physics Problem-Solving: The CMT-Benchmark paper (arXiv:2510.05228v2) presents a dataset of 50 problems in condensed matter theory “built by expert researchers.” According to the abstract, this benchmark addresses the scarcity of “evaluation on advanced research-level problems in hard sciences,” despite LLMs showing “remarkable progress in coding and math problem-solving.”

Moral Judgment Analysis: A third paper (arXiv:2511.08565v2) titled “Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models” investigates how LLMs “express and shift moral judgments” when prompted to assume different personas. The research is motivated by LLMs “increasingly operat[ing] in social contexts,” according to the abstract.

All three papers represent cross-listed submissions to the AI category on arXiv, indicating interdisciplinary relevance.