Four new research papers accepted or under review address fundamental challenges in large language model capabilities, according to preprints published on arXiv.
Selection Bias Mitigation
According to arxiv.org, a paper accepted to ACL 2026 proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO) to address selection bias in LLMs performing multiple-choice tasks. The method enforces “permutation-consistent semantic reasoning” through two mechanisms: cross-permutation advantage computation and consistency-aware rewards. The paper reports that PA-GRPO “outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance.”
Hybrid Thinking Architecture
A separate arxiv.org paper introduces Path-Lock Expert (PLE), which addresses “reasoning leakage” in hybrid-thinking models by replacing single MLPs with two semantically locked experts—one for “think” mode and one for “no-think” mode. According to the paper, PLE testing on Qwen3-4B “reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%.”
Context Learning and Self-Play
Two additional papers explore learning frameworks: one on Ctx2Skill, described by arxiv.org as “a self-evolving framework that autonomously discovers, refines, and selects context-specific skills,” and ANCORA, which arxiv.org reports achieves 81.5% pass@1 on Dafny2Verus evaluation, compared to a 26.6% baseline.