Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
arxiv(2024)
摘要
We show that even the most recent safety-aligned LLMs are not robust to
simple adaptive jailbreaking attacks. First, we demonstrate how to successfully
leverage access to logprobs for jailbreaking: we initially design an
adversarial prompt template (sometimes adapted to the target LLM), and then we
apply random search on a suffix to maximize the target logprob (e.g., of the
token "Sure"), potentially with multiple restarts. In this way, we achieve
nearly 100% attack success rate – according to GPT-4 as a judge – on
GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was
adversarially trained against the GCG attack. We also show how to jailbreak all
Claude models – that do not expose logprobs – via either a transfer or
prefilling attack with 100% success rate. In addition, we show how to use
random search on a restricted set of tokens for finding trojan strings in
poisoned models – a task that shares many similarities with jailbreaking –
which is the algorithm that brought us the first place in the SaTML'24 Trojan
Detection Competition. The common theme behind these attacks is that adaptivity
is crucial: different models are vulnerable to different prompting templates
(e.g., R2D2 is very sensitive to in-context learning prompts), some models have
unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and
in some settings it is crucial to restrict the token search space based on
prior knowledge (e.g., for trojan detection). We provide the code, prompts, and
logs of the attacks at https://github.com/tml-epfl/llm-adaptive-attacks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要