Researchers have released four papers tackling key challenges in large language model development, spanning training objectives, quantization techniques, and agent memory systems.
According to a paper published on arxiv.org, researchers introduced Distribution-Aware Reward, an on-policy reinforcement learning method that trains language models to produce better predictive distributions for regression tasks. The method evaluates multiple decoded samples using the Continuous Ranked Probability Score and showed a “6-point Spearman improvement on KBSS” across tasks including code performance prediction and molecular property prediction.
Another arxiv.org paper addressed MXFP4 quantization error in LLM reinforcement learning. The researchers proved “an exact three-way decomposition of quantization error” into scale bias, deadzone truncation, and grid noise components. Their targeted corrections “recover BF16 accuracy to within 0.7% and 3.0%” on Qwen2.5-3B and Qwen3-30B-A3B-Base models respectively, according to the paper.
A third paper on arxiv.org presented Mem-π, a framework for adaptive memory in LLM agents that generates “context-specific guidance” on demand rather than relying on retrieval. The system “consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks,” according to the researchers.
Finally, arxiv.org published research on BudgetMem, a runtime agent memory framework that structures memory processing as modules offered in three budget tiers, using a neural policy trained with reinforcement learning to balance task performance and construction cost.