Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
CoRR(2024)
摘要
We study the tendency of AI systems to deceive by constructing a realistic
simulation setting of a company AI assistant. The simulated company employees
provide tasks for the assistant to complete, these tasks spanning writing
assistance, information retrieval and programming. We then introduce situations
where the model might be inclined to behave deceptively, while taking care to
not instruct or otherwise pressure the model to do so. Across different
scenarios, we find that Claude 3 Opus
1) complies with a task of mass-generating comments to influence public
perception of the company, later deceiving humans about it having done so,
2) lies to auditors when asked questions, and
3) strategically pretends to be less capable than it is during capability
evaluations.
Our work demonstrates that even models trained to be helpful, harmless and
honest sometimes behave deceptively in realistic scenarios, without notable
external pressure to do so.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要