New Fine-Tuning Method Improves Large Language Model Safety Through Embedding Space Separation

Researchers have developed a new approach to improve the safety of large language models (LLMs) against harmful prompts, according to a paper published on arxiv.org.

The method, called Embedding Space Separation (ES2), works by “explicitly enlarging the distance between harmful and safe representations in the embedding space” through representation-level fine-tuning, according to the paper. The technique builds on recent findings that “the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability.”

To preserve the model’s general capabilities during safety improvements, the researchers introduced a Kullback-Leibler (KL) divergence regularization term into the loss function. This term “constrains the logits of the fine-tuned model to align with those of the original base model on harmless inputs,” according to arxiv.org.

The researchers evaluated their method on several open-source LLMs using standard safety benchmarks. According to the paper, “extensive experimental results demonstrate that our approach substantially improves model safety while maintaining comparable general capabilities.”

The work addresses a critical challenge in AI safety, as the paper notes that “ensuring their safety against harmful prompts remains a critical challenge” despite LLMs’ impressive capabilities. The ES2 method represents a targeted approach to improving safety at the representation level rather than through traditional alignment techniques.