New Research Addresses Mathematical Reasoning, Inference Optimization, and Training Stability in LLMs

Three new arXiv papers tackle adaptive reasoning strategies, long-context inference optimization, and reinforcement learning training instabilities.

New Research Addresses Mathematical Reasoning, Inference Optimization, and Training Stability in LLMs

Three recent papers on arXiv address key challenges in large language model development and deployment.

Adaptive Mathematical Reasoning

According to arXiv paper 2502.12022v4, existing approaches to mathematical reasoning with LLMs rely on either Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. The paper explores combining these methods through adaptive reasoning strategies tailored to model capabilities.

Long-Context Inference Optimization

ArXiv paper 2506.08373v3 addresses the challenge of optimizing inference for long-context LLMs, noting that “Transformers” have “quadratic compute and linear memory cost.” The research examines draft-based approximate inference methods as an alternative to existing approaches like key-value (KV) cache dropping.

RLVR Training Stability

A new paper (arXiv:2602.01103v1) investigates instabilities in reinforcement learning with verifiable rewards (RLVR). According to the abstract, while prolonged RLVR “has been shown to drive continuous improvements in the reasoning capabilities of large language models,” training “is often prone to instabilities, especially in Mixture-of-Experts” models. The research examines these issues through “objective-level hacking.”

All three papers represent ongoing efforts to improve LLM performance, efficiency, and training stability.