New Research Reveals Social Alignment Gaps in Large Language Model Reward Systems

Study finds reward models used to align LLMs often prefer socially undesirable responses, revealing critical gaps in current evaluation methods.

A new study published on arXiv has identified significant failures in how reward models—key components used to align large language models with human preferences—handle socially consequential decisions.

According to the research paper titled “Misaligned by Reward: Socially Undesirable Preferences in LLMs,” researchers evaluated five publicly available reward models and two instruction-tuned models across four critical domains: bias, safety, morality, and ethical reasoning. The study found that “the models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions.”

The researchers introduced a framework that converts social evaluation datasets into pairwise preference data to test whether reward models prefer socially undesirable responses. According to the paper, “no single model performing best overall” was identified, with substantial variation observed across domains.

The study also revealed a critical trade-off: “stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness.”

The researchers concluded that “standard reward benchmarks are insufficient for assessing social alignment” and called for evaluations that directly measure the social preferences encoded in reward models. The findings highlight potential risks in current AI alignment approaches, as existing evaluations “focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences.”