According to MIT Technology Review, Google DeepMind is advocating for the moral behavior of large language models to be evaluated with the same rigor applied to technical capabilities like coding or mathematics.
The research group raises concerns about what they describe as potential “virtue signaling” in chatbots—questioning whether AI systems genuinely exhibit ethical behavior or merely appear to do so. This scrutiny becomes increasingly important as users turn to LLMs for sensitive applications including companionship, therapy, and medical advice, according to the report.
DeepMind’s call for systematic evaluation frameworks comes as these AI systems improve and take on more consequential roles in users’ lives. The researchers suggest that current testing methods may be insufficient for assessing how these models behave when providing guidance in ethically complex situations.
The proposal reflects growing awareness in the AI research community that technical performance metrics alone cannot capture the full implications of deploying language models in roles that require moral judgment and ethical reasoning. According to MIT Technology Review, this represents a push toward more comprehensive evaluation standards that account for the real-world contexts in which these systems operate.