Three New Studies Examine Fine-Tuning Safety, Security, and Mechanisms in Large Language Models

Three recent arXiv preprints address critical aspects of fine-tuning large language models (LLMs), focusing on safety preservation, secure code generation, and understanding training mechanisms.

Safety Degradation During Fine-Tuning

According to arXiv:2505.14185v3, “Safety Subspaces are Not Linearly Distinct,” LLM safety alignment is brittle and can degrade during further fine-tuning. The research notes that “further fine-tuning, even on benign or lightly contaminated data, can degrade safety” in models that have undergone safety alignment to produce socially acceptable responses.

Secure Code Generation Approach

A separate study (arXiv:2602.07422v1) titled “Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model” addresses LLMs’ tendency to generate insecure code, which the authors identify as “a major barrier to real-world deployment.” The paper notes that “existing secure code alignment methods often suffer from a functionality-security” trade-off.

Understanding Fine-Tuning Mechanisms

The third paper (arXiv:2602.08239v1), “Linearization Explains Fine-Tuning in Large Language Models,” examines Parameter-Efficient Fine-Tuning (PEFT) techniques. According to the abstract, “the mechanisms underlying their training performance and generalization remain underexplored,” suggesting the study provides new insights into how these widely-used adaptation methods function.