EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Weikang Zhou,Xiao Wang,Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang,Shihan Dou,Zhiheng Xi,Rui Zheng,Songyang Gao,Yicheng Zou,Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao,Tao Gui,Qi Zhang,Xuanjing Huang

arxiv(2024)

引用 0|浏览7
暂无评分
摘要
Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60 advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57 resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要