Jatmo: Prompt Injection Defense by Task-Specific Finetuning
CoRR(2023)
摘要
Large Language Models (LLMs) are attracting significant research attention
due to their instruction-following abilities, allowing users and developers to
leverage LLMs for a variety of tasks. However, LLMs are vulnerable to
prompt-injection attacks: a class of attacks that hijack the model's
instruction-following abilities, changing responses to prompts to undesired,
possibly malicious ones. In this work, we introduce Jatmo, a method for
generating task-specific models resilient to prompt-injection attacks. Jatmo
leverages the fact that LLMs can only follow instructions once they have
undergone instruction tuning. It harnesses a teacher instruction-tuned model to
generate a task-specific dataset, which is then used to fine-tune a base model
(i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a
dataset of inputs for the task: it uses the teacher model to generate outputs.
For situations with no pre-existing datasets, Jatmo can use a single example,
or in some cases none at all, to produce a fully synthetic dataset. Our
experiments on six tasks show that Jatmo models provide the same quality of
outputs on their specific task as standard LLMs, while being resilient to
prompt injections. The best attacks succeeded in less than 0.5
against our models, versus over 90
release Jatmo at https://github.com/wagner-group/prompt-injection-defense.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要