JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
arxiv(2024)
摘要
Jailbreak attacks cause large language models (LLMs) to generate harmful,
unethical, or otherwise objectionable content. Evaluating these attacks
presents a number of challenges, which the current collection of benchmarks and
evaluation techniques do not adequately address. First, there is no clear
standard of practice regarding jailbreaking evaluation. Second, existing works
compute costs and success rates in incomparable ways. And third, numerous works
are not reproducible, as they withhold adversarial prompts, involve
closed-source code, or rely on evolving proprietary APIs. To address these
challenges, we introduce JailbreakBench, an open-sourced benchmark with the
following components: (1) a new jailbreaking dataset containing 100 unique
behaviors, which we call JBB-Behaviors; (2) an evolving repository of
state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts;
(3) a standardized evaluation framework that includes a clearly defined threat
model, system prompts, chat templates, and scoring functions; and (4) a
leaderboard that tracks the performance of attacks and defenses for various
LLMs. We have carefully considered the potential ethical implications of
releasing this benchmark, and believe that it will be a net positive for the
community. Over time, we will expand and adapt the benchmark to reflect
technical and methodological advances in the research community.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要