Researchers Develop Methods to Help Language Models Express Uncertainty and Resist Sycophancy

New techniques enable LLMs to say 'I don't know' when uncertain and resist changing answers under social pressure, improving reliability.

Researchers have introduced multiple approaches to address persistent reliability issues in large language models, including hallucinations and sycophancy.

According to arxiv.org, a new paper proposes knowledge-weighted fine-tuning that addresses “knowledge misalignment between pre-training and fine-tuning.” The method estimates an “instance-level knowledge score via multi-sampled inference” and scales the learning signal based on the model’s existing knowledge while encouraging explicit “I don’t know” responses for out-of-scope queries. The approach “allows the model to explicitly express uncertainty when it lacks knowledge, while maintaining accuracy on questions it can answer,” according to the paper.

In a separate arxiv.org paper on sycophancy—the tendency for models to shift positions toward perceived user preferences—researchers identify that “standard alignment methods fail to correct this because scalar reward models conflate two distinct failure modes”: pressure capitulation (changing correct answers under social pressure) and evidence blindness (ignoring provided context). The paper introduces a “multi-component Group Relative Policy Optimisation (GRPO) reward that decomposes the training signal into five terms: pressure resistance, context fidelity, position consistency, agreement suppression, and factual correctness.” According to the research, this approach “reduces answer-priming sycophancy by up to 17 points on SycophancyEval.”

Both methods aim to improve model reliability through more sophisticated training approaches that address specific failure modes.