Researchers have published two papers proposing novel approaches to improve large language model training by focusing on reasoning processes rather than just final outputs.
According to an arXiv paper set to appear at ICML 2026, TUR-DPO (Topology- and Uncertainty-Aware Direct Preference Optimization) addresses limitations in standard Direct Preference Optimization (DPO) by “rewards how answers are derived, not only what they say.” The method elicits “lightweight reasoning topologies” and combines semantic faithfulness, utility, and topology quality into a calibrated uncertainty signal. The paper reports that across 7-8B parameter models and benchmarks including mathematical reasoning, factual question answering, summarization, and dialogue, TUR-DPO “improves judge win-rates, faithfulness, and calibration relative to DPO while preserving training simplicity and avoiding online rollouts.”
A separate arXiv paper introduces PORTool, which addresses credit-assignment challenges in training tool-use agents. According to the paper, PORTool generates “a rewarded rollout tree in which trajectories share prefixes before branching, enabling direct comparisons among alternative tool-use decisions within the same context.” The system estimates each step’s importance using a “correctness-dominant signal” that considers whether descendants produce correct answers and whether tool calls meet formatting and execution requirements. The experiments show PORTool “improves final-answer accuracy while reducing tool-call steps compared with state-of-the-art policy-optimization baselines.”