Four New Papers Tackle Reasoning, Bias, and Learning Challenges in Large Language Models

Four new research papers accepted or under review address fundamental challenges in large language model capabilities, according to preprints published on arXiv.

Selection Bias Mitigation

According to arxiv.org, a paper accepted to ACL 2026 proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO) to address selection bias in LLMs performing multiple-choice tasks. The method enforces “permutation-consistent semantic reasoning” through two mechanisms: cross-permutation advantage computation and consistency-aware rewards. The paper reports that PA-GRPO “outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance.”

Hybrid Thinking Architecture

A separate arxiv.org paper introduces Path-Lock Expert (PLE), which addresses “reasoning leakage” in hybrid-thinking models by replacing single MLPs with two semantically locked experts—one for “think” mode and one for “no-think” mode. According to the paper, PLE testing on Qwen3-4B “reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%.”

Context Learning and Self-Play

Two additional papers explore learning frameworks: one on Ctx2Skill, described by arxiv.org as “a self-evolving framework that autonomously discovers, refines, and selects context-specific skills,” and ANCORA, which arxiv.org reports achieves 81.5% pass@1 on Dafny2Verus evaluation, compared to a 26.6% baseline.