Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
CoRR(2024)
摘要
Large Language Models (LLMs) exhibit impressive capabilities but also present
risks such as biased content generation and privacy issues. One of the current
alignment techniques includes principle-driven integration, but it faces
challenges arising from the imprecision of manually crafted rules and
inadequate risk perception in models without safety training. To address these,
we introduce Guide-Align, a two-stage approach. Initially, a safety-trained
model identifies potential risks and formulates specific guidelines for various
inputs, thereby establishing a comprehensive library of guidelines and models
for input-guidelines retrieval. Subsequently, the retrieval model correlates
new inputs with pertinent guidelines, guiding LLMs in response generation to
ensure safe and high-quality outputs, thus aligning with human values. An
additional optional stage involves fine-tuning a model with new well-aligned
datasets generated through the process implemented in the second stage. Our
method customizes guidelines to accommodate diverse inputs, thereby enhancing
the fine-grainedness and comprehensiveness of the guideline library.
Furthermore, it incorporates safety expertise from a safety-trained LLM through
a lightweight retrieval model. We evaluated our approach on three benchmarks,
demonstrating significant improvements in LLM security and quality. Notably,
our fine-tuned model, Labrador, even at 13 billion parameters, outperforms
GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要