New Research Examines Fairness and Explanation Quality in Large Language Models

According to research published on arxiv.org, large language models exhibit statistically significant disparities in how they justify decisions across demographic groups, a phenomenon researchers term “explanation fairness.”

The study introduces the Explanation Fairness Taxonomy (EFT), comprising five dimensions: Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity. According to the paper, researchers tested this framework across 80 prompt templates in four decision domains—hiring, medical triage, credit assessment, and legal judgment—using five LLMs: GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, and Qwen3 32B.

The research found that “across up to 400 prompt pairs, all eight EFT metrics show statistically significant disparities,” with Cohen’s d ranging from small to large effects. According to the findings, model choice strongly influences disparity magnitude, with Qwen3 32B exhibiting “verbosity disparities 5.9x larger than LLaMA 3.3 70B.”

The study tested two prompting-based mitigation strategies, which showed “significant reductions in EFP disparity (78-95%)” but had “no significant effect on stylistic dimensions.” According to the researchers, this suggests that stylistic explanation inequalities may be “encoded in pre-training distributions and are not resolvable through deployment-level instruction alone.”

The paper offers a “reproducible measurement framework” for explanation-level fairness auditing, with implications for AI regulation and deployment practices.