Researchers Advance LLM Reasoning Through Inverse Reinforcement Learning and Optimization Frameworks

Multiple research papers published on arXiv address challenges in improving reasoning capabilities of large language models (LLMs).

According to arxiv.org, researchers propose an adversarial inverse reinforcement learning (AIRL) framework that learns reasoning rewards directly from expert demonstrations, addressing limitations of supervised fine-tuning and outcome-based reinforcement learning. The framework evaluates reward granularities across sparse, interval, and dense levels. The paper reports that the learned reasoning rewards improved performance over supervised fine-tuning on medical reasoning (MedReason), mathematics (GSM8K), and scientific question-answering (MMLU-Pro), with inference-time reranking gains of up to 17.4 percentage points.

Separately, another arxiv.org paper introduces DynaMO, a “theoretically-grounded dual-pronged optimization framework” for Reinforcement Learning with Verifiable Rewards (RLVR). According to the paper, the framework addresses resource allocation challenges by proving that “uniform allocation is suboptimal” and derives variance-minimizing allocation. The researchers report “consistent improvements over strong RLVR baselines” on mathematical reasoning benchmarks.

Additionally, arxiv.org published the first survey on abductive reasoning in LLMs, establishing a unified two-stage definition that categorizes work into “Hypothesis Generation” and “Hypothesis Selection.” The survey identifies “critical gaps in current approaches” including static benchmark design and limited mechanistic understanding.

Another arxiv.org paper introduces DAVinCI, a framework for attribution and verification that reportedly improved classification accuracy, precision, recall, and F1-score by 5-20% on datasets including FEVER and CLIMATE-FEVER.