New Research Reveals Widespread Operational Safety Failures in Large Language Models

Study introduces OffTopicEval benchmark, finding even top-performing LLMs fall short on operational safety with scores below 80%.

Researchers have introduced OffTopicEval, a new evaluation suite designed to measure “operational safety” in large language models—defined as an LLM’s ability to appropriately accept or refuse user queries when tasked with a specific purpose, according to a paper published on arxiv.org.

The evaluation of six model families comprising 20 open-weight LLMs revealed that “all of them remain highly operationally unsafe,” according to the research. Even the strongest performers fell significantly short of reliable safety thresholds: Qwen-3 (235B) achieved 77.77% and Mistral (24B) reached 79.96%, while GPT models plateaued in the 62-73% range. Phi models scored mid-level at 48-70%, while Gemma and Llama-3 performed particularly poorly at 39.53% and 23.84%, respectively.

To address these failures, the researchers proposed prompt-based steering methods called query grounding (Q-ground) and system-prompt grounding (P-ground). According to arxiv.org, Q-ground provided consistent gains of up to 23%, while P-ground delivered even larger improvements, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%.

The findings highlight a fundamental concern for enterprises deploying LLM-based agents: ensuring models are safe for their intended use case, rather than solely focusing on generic harms. The research emphasizes “the urgent need for operational safety interventions” while demonstrating promise in prompt-based steering as an initial solution.