Refusal in Language Models Is Mediated by a Single Direction
arxiv(2024)
摘要
Conversational large language models are fine-tuned for both
instruction-following and safety, resulting in models that obey benign requests
but refuse harmful ones. While this refusal behavior is widespread across chat
models, its underlying mechanisms remain poorly understood. In this work, we
show that refusal is mediated by a one-dimensional subspace, across 13 popular
open-source chat models up to 72B parameters in size. Specifically, for each
model, we find a single direction such that erasing this direction from the
model's residual stream activations prevents it from refusing harmful
instructions, while adding this direction elicits refusal on even harmless
instructions. Leveraging this insight, we propose a novel white-box jailbreak
method that surgically disables refusal with minimal effect on other
capabilities. Finally, we mechanistically analyze how adversarial suffixes
suppress propagation of the refusal-mediating direction. Our findings
underscore the brittleness of current safety fine-tuning methods. More broadly,
our work showcases how an understanding of model internals can be leveraged to
develop practical methods for controlling model behavior.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要