Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology
CoRR(2024)
摘要
Large Language Models (LLMs) have gradually become the gateway for people to
acquire new knowledge. However, attackers can break the model's security
protection ("jail") to access restricted information, which is called
"jailbreaking." Previous studies have shown the weakness of current LLMs when
confronted with such jailbreaking attacks. Nevertheless, comprehension of the
intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak
prompts is noticeably lacking. Our research provides a psychological
explanation of the jailbreak prompts. Drawing on cognitive consistency theory,
we argue that the key to jailbreak is guiding the LLM to achieve cognitive
coordination in an erroneous direction. Further, we propose an automatic
black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
This method progressively induces the model to answer harmful questions via
multi-step incremental prompts. We instantiated a prototype system to evaluate
the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success
rate of 83.9
insights into the intrinsic decision-making logic of LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要