New Research Reveals Social Alignment Failures in LLM Reward Models

Study finds reward models used in AI alignment often prefer socially undesirable responses across bias, safety, and ethical domains.

According to a preprint published on arXiv.org, reward models—a key component in aligning large language models with human preferences—frequently exhibit socially undesirable preferences across critical domains including bias, safety, morality, and ethical reasoning.

The research, titled “Misaligned by Reward: Socially Undesirable Preferences in LLMs,” introduces a framework that converts social evaluation datasets into pairwise preference data to test whether reward models prefer socially undesirable responses. According to the paper, existing evaluations “focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences.”

The study evaluated five publicly available reward models and two instruction-tuned models used as reward proxies. According to the findings, “no single model performing best overall” and the models “fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions.”

The research also reveals a key alignment trade-off: “stronger bias avoidance can reduce sensitivity to context,” creating tension between avoiding biased outcomes and preserving contextual faithfulness. The authors conclude that “standard reward benchmarks are insufficient for assessing social alignment” and emphasize the need for evaluations that directly measure social preferences encoded in reward models.