Tradeoffs Between Alignment and Helpfulness in Language Models
CoRR(2024)
摘要
Language model alignment has become an important component of AI safety,
allowing safe interactions between humans and language models, by enhancing
desired behaviors and inhibiting undesired ones. It is often done by tuning the
model or inserting preset aligning prompts. Recently, representation
engineering, a method which alters the model's behavior via changing its
representations post-training, was shown to be effective in aligning LLMs (Zou
et al., 2023a). Representation engineering yields gains in alignment oriented
tasks such as resistance to adversarial attacks and reduction of social biases,
but was also shown to cause a decrease in the ability of the model to perform
basic tasks. In this paper we study the tradeoff between the increase in
alignment and decrease in helpfulness of the model. We propose a theoretical
framework which provides bounds for these two quantities, and demonstrate their
relevance empirically. Interestingly, we find that while the helpfulness
generally decreases, it does so quadratically with the norm of the
representation engineering vector, while the alignment increases linearly with
it, indicating a regime in which it is efficient to use representation
engineering. We validate our findings empirically, and chart the boundaries to
the usefulness of representation engineering for alignment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要