Fast Adversarial Attacks on Language Models In One GPU Minute
CoRR(2024)
摘要
In this paper, we introduce a novel class of fast, beam search-based
adversarial attack (BEAST) for Language Models (LMs). BEAST employs
interpretable parameters, enabling attackers to balance between attack speed,
success rate, and the readability of adversarial prompts. The computational
efficiency of BEAST facilitates us to investigate its applications on LMs for
jailbreaking, eliciting hallucinations, and privacy attacks. Our gradient-free
targeted attack can jailbreak aligned LMs with high attack success rates within
one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute
with a success rate of 89
takes over an hour to achieve 70
48GB GPU. Additionally, we discover a unique outcome wherein our untargeted
attack induces hallucinations in LM chatbots. Through human evaluations, we
find that our untargeted attack causes Vicuna-7B-v1.5 to produce 15
incorrect outputs when compared to LM outputs in the absence of our attack. We
also learn that 22
are not relevant to the original prompt. Further, we use BEAST to generate
adversarial prompts in a few seconds that can boost the performance of existing
membership inference attacks for LMs. We believe that our fast attack, BEAST,
has the potential to accelerate research in LM security and privacy. Our
codebase is publicly available at https://github.com/vinusankars/BEAST.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要