EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
arxiv(2024)
摘要
Jailbreak attacks are crucial for identifying and mitigating the security
vulnerabilities of Large Language Models (LLMs). They are designed to bypass
safeguards and elicit prohibited outputs. However, due to significant
differences among various jailbreak methods, there is no standard
implementation framework available for the community, which limits
comprehensive security evaluations. This paper introduces EasyJailbreak, a
unified framework simplifying the construction and evaluation of jailbreak
attacks against LLMs. It builds jailbreak attacks using four components:
Selector, Mutator, Constraint, and Evaluator. This modular framework enables
researchers to easily construct attacks from combinations of novel and existing
components. So far, EasyJailbreak supports 11 distinct jailbreak methods and
facilitates the security validation of a broad spectrum of LLMs. Our validation
across 10 distinct LLMs reveals a significant vulnerability, with an average
breach probability of 60
advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success
Rates (ASR) of 57
resources for researchers, including a web platform, PyPI published package,
screencast video, and experimental outputs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要