TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
CoRR(2024)
摘要
Large Language Model (LLM) services and models often come with legal rules on
who can use them and how they must use them. Assessing the compliance of the
released LLMs is crucial, as these rules protect the interests of the LLM
contributor and prevent misuse. In this context, we describe the novel problem
of Black-box Identity Verification (BBIV). The goal is to determine whether a
third-party application uses a certain LLM through its chat function. We
propose a method called Targeted Random Adversarial Prompt (TRAP) that
identifies the specific LLM in use. We repurpose adversarial suffixes,
originally proposed for jailbreaking, to get a pre-defined answer from the
target LLM, while other models give random answers. TRAP detects the target
LLMs with over 95
after a single interaction. TRAP remains effective even if the LLM has minor
changes that do not significantly alter the original function.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要