Enhancing LLM Safety via Constrained Direct Preference Optimization
CoRR(2024)
Abstract
The rapidly increasing capabilities of large language models (LLMs) raise an
urgent need to align AI systems with diverse human preferences to
simultaneously enhance their usefulness and safety, despite the often
conflicting nature of these goals. To address this important problem, a
promising approach is to enforce a safety constraint at the fine-tuning stage
through a constrained Reinforcement Learning from Human Feedback (RLHF)
framework. This approach, however, is computationally expensive and often
unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension
of the recently proposed Direct Preference Optimization (DPO) approach for
fine-tuning LLMs that is both efficient and lightweight. By integrating dual
gradient descent and DPO, our method identifies a nearly optimal trade-off
between helpfulness and harmlessness without using reinforcement learning.
Empirically, our approach provides a safety guarantee to LLMs that is missing
in DPO while achieving significantly higher rewards under the same safety
constraint compared to a recently proposed safe RLHF approach.
Warning: This paper contains example data that may be offensive or harmful.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined