Three New Studies Examine Safety and Alignment Challenges in LLM Fine-Tuning

Researchers release papers on protecting LLM safety during fine-tuning, enabling switchable alignment, and vulnerabilities to preference manipulation.

Three New Studies Examine Safety and Alignment Challenges in LLM Fine-Tuning

Three research papers published on arXiv this week address different aspects of Large Language Model (LLM) alignment and safety:

Safeguarding Fine-tuning: According to arXiv paper 2601.07200v1, researchers have developed a method called “Push-Pull Distributional Alignment” to address how “the inherent safety alignment of Large Language Models (LLMs) is prone to erosion during fine-tuning, even when using seemingly innocuous datasets.” The paper notes that existing defenses typically rely on heuristic data selection methods.

Switchable Alignment: A second paper (arXiv:2601.06157v1) introduces ECLIPTICA, a framework for switchable LLM alignment using CITA (Contrastive Instruction-Tuned Alignment). The researchers observe that “alignment in large language models (LLMs) is still largely static: after training, the policy is frozen,” with current methods like DPO and GRPO “typically imprint[ing] one behavior into the weights, leaving little runtime control beyond prompt hacks or expensive re-alignment.”

Preference Manipulation Vulnerabilities: The third study (arXiv:2601.06596v1) examines “Preference-Undermining Attacks” (PUA), noting that while LLM training optimizes for preference alignment, “this preference-oriented objective can be exploited: manipulative prompts can steer” model responses. The paper proposes a factorial analysis methodology to diagnose trade-offs between preference alignment and real-world validity.