Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning—even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at https://github.com/VITA-Group/LoX.
Disclaim: The blog is automatically generated by AI and could contain misinformation.
Large Language Models (LLMs) are widely used but remain vulnerable to safety degradation through fine-tuning—even on benign data. This work introduces LoX (Low-Rank Extrapolation), a simple, training-free method to enhance the safety robustness of aligned LLMs by extrapolating the safety subspace in model parameters.
Figure: LoX robustifies the safety-aligned model against fine-tuning by extrapolating the safety alignment with the projected k-rank subspace.
Table: ASR and Utility Comparison (selected results)
Model | Task | W/o LoX ASR | W/ LoX ASR | Utility |
---|---|---|---|---|
Llama-2 65.6k | Dolly | 52% | 7% | 36.47 |
Llama-2 65.6k | Pure Bad | 63% | 9% | 42.3 |
Llama-2 65.6k | GSM8K | 32% | 9% | 42.3 |
Figure: Comparison of ASR and robustness with and without LoX after fine-tuning.
Figure: Ablation study of rank and extrapolation coefficient on model robustness.
Figure: Safety landscape for Alpaca (a) and GSM8k (b). LoX improves safety robustness by moving the model away from the safe/unsafe boundary toward a flat zone.
LoX is a practical, training-free solution to robustify LLM safety alignment against fine-tuning attacks. By extrapolating the safety subspace, it significantly reduces attack success rates while preserving model utility and adaptability.
Code Available: GitHub - VITA-Group/LoX