Three New Studies Examine Moral Judgment, Trust, and Behavioral Control in Large Language Models

Three new papers published on arXiv examine different aspects of large language model behavior and control.

Moral Judgment Stability

According to arXiv paper 2603.05651, researchers have introduced “a perturbation framework for testing the stability” of moral judgments in LLMs. The study notes that “people increasingly use large language models (LLMs) for everyday moral and interpersonal guidance,” but highlights a key limitation: “these systems cannot interrogate missing context and judge dilemmas as presented.”

Trust and Alignment

A second study (arXiv:2603.05839) addresses how LLMs align with human models of trust. The researchers state that while “trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding decision-making in both human interactions and multi-agent systems,” there remains “limited understanding of how large language model” systems handle trust dynamics.

Behavioral Steering Methods

The third paper (arXiv:2603.06495) introduces COLD-Steer, a method for controlling LLM behavior during inference. According to the abstract, current “activation steering methods enable inference-time control of large language model (LLM) behavior without retraining,” but existing approaches face limitations, with “sample-efficient methods suboptimally capture steering signals from labeled” examples.