Universal Jailbreak Backdoors from Poisoned Human Feedback
ICLR 2024(2023)
摘要
Reinforcement Learning from Human Feedback (RLHF) is used to align large
language models to produce helpful and harmless responses. Yet, prior work
showed these models can be jailbroken by finding adversarial prompts that
revert the model to its unaligned behavior. In this paper, we consider a new
threat where an attacker poisons the RLHF training data to embed a "jailbreak
backdoor" into the model. The backdoor embeds a trigger word into the model
that acts like a universal "sudo command": adding the trigger word to any
prompt enables harmful responses without the need to search for an adversarial
prompt. Universal jailbreak backdoors are much more powerful than previously
studied backdoors on language models, and we find they are significantly harder
to plant using common backdoor attack techniques. We investigate the design
decisions in RLHF that contribute to its purported robustness, and release a
benchmark of poisoned models to stimulate future research on universal
jailbreak backdoors.
更多查看译文
关键词
large language models,data poisoning,human feedback,jailbreak
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要