AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
arxiv(2024)
摘要
Despite extensive pre-training and fine-tuning in moral alignment to prevent
generating harmful information at user request, large language models (LLMs)
remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense,
a response-filtering based multi-agent defense framework that filters harmful
responses from LLMs. This framework assigns different roles to LLM agents and
employs them to complete the defense task collaboratively. The division in
tasks enhances the overall instruction-following of LLMs and enables the
integration of other defense components as tools. AutoDefense can adapt to
various sizes and kinds of open-source LLMs that serve as agents. Through
conducting extensive experiments on a large scale of harmful and safe prompts,
we validate the effectiveness of the proposed AutoDefense in improving the
robustness against jailbreak attacks, while maintaining the performance at
normal user request. Our code and data are publicly available at
https://github.com/XHMY/AutoDefense.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要