New Research Exposes Fragility of LLM Moral Judgments and Trust Representations

According to research published on arxiv.org, large language models demonstrate significant instability in moral judgments when presented with the same underlying dilemma framed differently. The study evaluated GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, and Qwen2.5-72B across 2,939 moral dilemmas from r/AmItheAsshole, generating 129,156 total judgments.

According to the arxiv.org paper, point-of-view shifts induced 24.3% instability in moral judgments, while surface perturbations produced only 7.5% flip rates. The study found that 37.9% of dilemmas remained robust to surface changes but flipped under perspective shifts, indicating “models condition on narrative voice as a pragmatic cue.” Protocol choices showed the strongest effect: structured protocols agreed only 67.6% of the time, and just 35.7% of judgments matched across all three evaluation methods.

The researchers concluded these results “show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.”

Separately, arxiv.org published research analyzing how LLMs internally represent trust. Using EleutherAI/gpt-j-6B and contrastive prompting techniques, researchers found the model’s internal trust representation “aligns most closely with the Castelfranchi socio-cognitive model, followed by the Marsh Model.” This work, appearing in ICAART 2026 proceedings, suggests “LLMs encode socio-cognitive constructs in their activation space in ways that support meaningful comparative analyses.”