The Art of Saying No: Contextual Noncompliance in Language Models
arxiv(2024)
摘要
Chat-based language models are designed to be helpful, yet they should not
comply with every user request. While most existing work primarily focuses on
refusal of "unsafe" queries, we posit that the scope of noncompliance should be
broadened. We introduce a comprehensive taxonomy of contextual noncompliance
describing when and how models should not comply with user requests. Our
taxonomy spans a wide range of categories including incomplete, unsupported,
indeterminate, and humanizing requests (in addition to unsafe requests). To
test noncompliance capabilities of language models, we use this taxonomy to
develop a new evaluation suite of 1000 noncompliance prompts. We find that most
existing models show significantly high compliance rates in certain previously
understudied categories with models like GPT-4 incorrectly complying with as
many as 30
strategies using a synthetically-generated training set of requests and
expected noncompliant responses. Our experiments demonstrate that while direct
finetuning of instruction-tuned models can lead to both over-refusal and a
decline in general capabilities, using parameter efficient methods like low
rank adapters helps to strike a good balance between appropriate noncompliance
and other capabilities.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要