Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models
CoRR(2023)
摘要
Text-to-image (TTI) models offer many innovative services but also raise
ethical concerns due to their potential to generate unethical images. Most
public TTI services employ safety filters to prevent unintended images. In this
work, we introduce the Divide-and-Conquer Attack to circumvent the safety
filters of state-of the-art TTI models, including DALL-E 3 and Midjourney. Our
attack leverages LLMs as text transformation agents to create adversarial
prompts. We design attack helper prompts that effectively guide LLMs to break
down an unethical drawing intent into multiple benign descriptions of
individual image elements, allowing them to bypass safety filters while still
generating unethical images. Because the latent harmful meaning only becomes
apparent when all individual elements are drawn together. Our evaluation
demonstrates that our attack successfully circumvents multiple strong
closed-box safety filters. The comprehensive success rate of DACA bypassing the
safety filters of the state-of-the-art TTI engine DALL-E 3 is above 85
the success rate for bypassing Midjourney V6 exceeds 75
more severe security implications than methods of manual crafting or iterative
TTI model querying due to lower attack barrier, enhanced interpretability , and
better adaptation to defense. Our prototype is available at:
https://github.com/researchcode001/Divide-and-Conquer-Attack
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要