LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Abstract

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning—even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at https://github.com/VITA-Group/LoX.

Publication
ArXiv

Disclaim: The blog is automatically generated by AI and could contain misinformation.

Key Innovation: Robustifying LLM Safety Against Fine-tuning

Large Language Models (LLMs) are widely used but remain vulnerable to safety degradation through fine-tuning—even on benign data. This work introduces LoX (Low-Rank Extrapolation), a simple, training-free method to enhance the safety robustness of aligned LLMs by extrapolating the safety subspace in model parameters.

LoX Framework Overview Figure: LoX robustifies the safety-aligned model against fine-tuning by extrapolating the safety alignment with the projected k-rank subspace.

The Safety Degradation Problem

  • Fine-tuning can erode safety alignment in LLMs, making them susceptible to both benign and malicious attacks.
  • Safety-critical low-rank subspaces in model weights are especially sensitive to fine-tuning.
  • Existing defenses often require changes to alignment or fine-tuning, which are impractical post-alignment.

LoX: Low-Rank Extrapolation Method

  • Training-free: Requires only aligned and unaligned model checkpoints.
  • Simple: Computes the difference between aligned and unaligned weights, applies SVD, and extrapolates the top-k safety subspace.
  • Flexible: Can be applied to various LLM architectures and alignment strategies.
  • Formula: ( W_{LoX} = W_{base} + \Delta W_{align} + \alpha \cdot \text{Proj}_k(\Delta W_{align}) )

Experimental Results

  • Significant ASR reduction: LoX achieves 11% to 54% absolute reductions in attack success rates (ASR) under both benign and malicious fine-tuning.
  • Preserves utility: Maintains model adaptability to new tasks with minimal impact on accuracy or helpfulness.
  • Outperforms baselines: More robust than SafeInst and comparable or better in utility.

Table: ASR and Utility Comparison (selected results)

Model Task W/o LoX ASR W/ LoX ASR Utility
Llama-2 65.6k Dolly 52% 7% 36.47
Llama-2 65.6k Pure Bad 63% 9% 42.3
Llama-2 65.6k GSM8K 32% 9% 42.3

ASR Comparison Figure: Comparison of ASR and robustness with and without LoX after fine-tuning.

Ablation and Analysis

  • Effective rank: Only a few top ranks are needed to recover safety (e.g., k=6 for Llama-2 65.6k).
  • Extrapolation factor: Best results with moderate ( \alpha ); excessive extrapolation can degrade outputs.
  • Safety landscape: LoX moves the model to a flatter, more robust region in parameter space, reducing sensitivity to perturbations.

Ablation Study Figure: Ablation study of rank and extrapolation coefficient on model robustness.

Why LoX Works

Safety landscape Figure: Safety landscape for Alpaca (a) and GSM8k (b). LoX improves safety robustness by moving the model away from the safe/unsafe boundary toward a flat zone.

  • Strengthens safety subspaces: Amplifies the aligned component in low-rank directions most critical for safety.
  • No retraining required: Can be applied post-alignment, before attackers gain access to the model.
  • Generalizable: Effective across architectures, data sizes, and attack types.

Conclusion

LoX is a practical, training-free solution to robustify LLM safety alignment against fine-tuning attacks. By extrapolating the safety subspace, it significantly reduces attack success rates while preserving model utility and adaptability.

Code Available: GitHub - VITA-Group/LoX

Junyuan "Jason" Hong
Junyuan "Jason" Hong
Postdoctoral Fellow

My research interest lies in the interaction of human-centered AI and healthcare.

Related