Towards General Conceptual Model Editing via Adversarial Representation Engineering
arxiv(2024)
摘要
Recent research has introduced Representation Engineering (RepE) as a
promising approach for understanding complex inner workings of large-scale
models like Large Language Models (LLMs). However, finding practical and
efficient methods to apply these representations for general and flexible model
editing remains an open problem. Inspired by the Generative Adversarial Network
(GAN) framework, we introduce a novel approach called Adversarial
Representation Engineering (ARE). This method leverages RepE by using a
representation sensor to guide the editing of LLMs, offering a unified and
interpretable framework for conceptual model editing without degrading baseline
performance. Our experiments on multiple conceptual editing confirm ARE's
effectiveness. Code and data are available at
https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要