New Benchmarks and Methods Target AI Safety Gaps
Researchers have introduced multiple new approaches to address safety and alignment challenges in AI systems, according to four papers published on arxiv.org and accepted to major 2026 conferences.
According to arxiv.org, Plan-RewardBench presents a trajectory-level preference benchmark designed to evaluate reward models in tool-integrated environments. The benchmark, accepted to ACL 2026, covers four task families including Safety Refusal, Tool-Irrelevance/Unavailability, Complex Planning, and Robust Error Recovery. The researchers report that “all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories.”
Another ACL 2026 paper addresses honesty in model unlearning. According to arxiv.org, existing unlearning methods “often hallucinate, generate abnormal token sequences, or behave inconsistently.” The researchers propose ReVa, a representation-alignment procedure that “achieves the highest rejection rate after two rounds of interaction, nearly doubling the performance of the second-best method.”
For jailbreak defense, arxiv.org reports that Safety-Aware Intent Defense (SAID) offers a “training-free jailbreak defense framework based on intent-level safety probing” that “achieves state-of-the-art defense performance in reducing harmful responses while maintaining competitive utility on benign tasks.”
Finally, arxiv.org describes Safety Internal (SInternal), accepted to ICML 2026, which “internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories.”