Four New arXiv Papers Address LLM Reasoning Reliability and Agent Memory Systems

Four papers published on arXiv address challenges in large language model reasoning and agent architectures.

According to arxiv.org, a paper accepted at the ICML 2026 Workshop introduces CGD-PD, a training-free method for three-way logical question answering that addresses “negation inconsistency” where models give contradictory answers to a hypothesis and its negation. On one validation split of FOLIO’s first-order logic benchmark, the method improved accuracy by 4.4 points on GPT-5.2 and 6.8 points on Claude Sonnet 4.5, according to the paper.

A separate study examined whether models’ chain-of-thought reasoning faithfully explains their decisions when contradictory information appears. According to arxiv.org, testing across 200 questions, 8 models, and 4 prompt conditions revealed that reasoning is “highly stable across opposite decisions” with flip pairs retaining 96% similarity. The paper found that “self-rated confidence carries a faint genuine signal” and that “GPT-4o is the only model with statistically reliable reasoning-decision coupling.”

Another paper introduces OGER (Offline-Guided Exploration Reward), which according to arxiv.org combines offline teacher guidance with online reinforcement learning to help models explore novel reasoning trajectories beyond their initial policy distribution.

Finally, researchers presented BudgetMem, a runtime agent memory framework that according to arxiv.org enables “explicit, query-aware performance-cost control” through budget-tier routing across memory modules, implemented as a neural policy trained with reinforcement learning.