Researchers have introduced InvThink, a training and prompting framework designed to improve language model safety through premortem reasoning, according to a paper published on arxiv.org.
According to the paper, InvThink structures model generation into three steps: enumerating potential harms, analyzing their consequences, and generating responses under explicit mitigation constraints. This approach differs from existing safety alignment methods that “optimize only for safe final responses.”
The framework demonstrated three key findings, according to arxiv.org. First, InvThink shows higher safety scores at larger model sizes compared to existing safety prompting and alignment baselines. Second, it mitigates what researchers call the “safety tax” — models trained with InvThink preserve their reasoning capability on standard benchmarks. Third, beyond general safety tasks, InvThink reduces harmful behavior in professional ethics domains including medicine, finance, and law, as well as in agentic misalignment scenarios.
According to the research, InvThink achieved up to 32% reduction in harmfulness over zero-shot baselines and 16% over SafetyPrompt. The researchers extended InvThink with supervised fine-tuning and GRPO-based reinforcement learning across three LLM families, according to arxiv.org. The paper was submitted in October 2025 with the most recent version published in May 2026.