STAPO: New Method Addresses Training Instability in Reinforcement Learning for Large Language Models

According to arxiv.org, researchers have introduced STAPO (Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens), a new approach to address training instability in reinforcement learning fine-tuning of large language models. The research, authored by Shiqi Liu and colleagues, identifies a critical problem: existing RL methods for improving LLM reasoning often suffer from “late-stage performance collapse,” leading to degraded reasoning quality and unstable training.

The team identified a key factor behind this instability: a small fraction of tokens, termed “spurious tokens” (around 0.01%), which contribute little to reasoning outcomes but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. According to arxiv.org, the researchers present “a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes.”

This work addresses broader challenges in applying reinforcement learning to language models. According to separate research on arxiv.org, RL for multi-step reasoning with LLMs typically relies on sparse terminal rewards, creating a “poorly conditioned credit-assignment problem” where final feedback is propagated uniformly across all intermediate decisions, leading to high gradient variance and unstable training.

The STAPO paper was published on May 26, 2026, on arxiv.org.