Three New Benchmarks Examine LLM Capabilities in Forecasting, Physics, and Moral Reasoning

Recent arXiv papers introduce benchmarks testing LLMs on context-based forecasting, condensed matter theory problems, and moral judgment consistency.

Three New Benchmarks Examine LLM Capabilities in Forecasting, Physics, and Moral Reasoning

Three recent papers on arXiv introduce new evaluation frameworks for testing large language models across diverse domains.

Context-Aided Forecasting: A paper titled “Beyond Naïve Prompting: Strategies for Improved Context-aided Forecasting with LLMs” (arXiv:2508.09904v2) examines how LLMs can integrate textual contextual information alongside historical data for real-world forecasting tasks, according to the abstract. The research addresses “critical challenges” in this area, though specific findings are not detailed in the provided excerpt.

Physics Problem-Solving: The CMT-Benchmark paper (arXiv:2510.05228v2) presents a dataset of 50 problems in condensed matter theory “built by expert researchers.” According to the abstract, this benchmark addresses the scarcity of “evaluation on advanced research-level problems in hard sciences,” despite LLMs showing “remarkable progress in coding and math problem-solving.”

Moral Judgment Analysis: A third paper (arXiv:2511.08565v2) titled “Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models” investigates how LLMs “express and shift moral judgments” when prompted to assume different personas. The research is motivated by LLMs “increasingly operat[ing] in social contexts,” according to the abstract.

All three papers represent cross-listed submissions to the AI category on arXiv, indicating interdisciplinary relevance.