New Research Examines Safety and Trustworthiness Challenges in Large Reasoning Models

According to a new paper on arxiv.org, researchers have developed RT-LRM, a unified benchmark designed to assess the trustworthiness of Large Reasoning Models (LRMs), which use explicit chains of thought for multi-step reasoning tasks. The research identifies novel safety risks specific to these models, including “CoT-hijacking” and “prompt-induced inefficiencies,” which existing evaluation methods don’t fully capture.

The RT-LRM benchmark evaluates three core dimensions: truthfulness, safety, and efficiency, using a suite of 30 reasoning tasks. According to the paper, experiments on 26 models revealed that “LRMs generally face trustworthiness challenges and tend to be more fragile than Large Language Models (LLMs) when encountering reasoning-induced risks.”

Separately, arxiv.org published research on evaluating AI manipulation capabilities. According to that paper, researchers conducted human-AI interaction studies with 10,101 participants across three domains (public policy, finance, and health) and three locations (US, UK, and India). The study found that “the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants.” The research emphasized that “context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used.”

Both papers will be open-sourced, according to their respective authors.