Entropy-Regularized Token-Level Policy Optimization for Large Language Models
CoRR(2024)
摘要
Large Language Models (LLMs) have shown promise as intelligent agents in
interactive decision-making tasks. Traditional approaches often depend on
meticulously designed prompts, high-quality examples, or additional reward
models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement
learning (RL) presents a dynamic alternative for LLMs to overcome these
dependencies by engaging directly with task-specific environments. Nonetheless,
it faces significant hurdles: 1) instability stemming from the exponentially
vast action space requiring exploration; 2) challenges in assigning token-level
credit based on action-level reward signals, resulting in discord between
maximizing rewards and accurately modeling corpus data. In response to these
challenges, we introduce Entropy-Regularized Token-level Policy Optimization
(ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the
token level. At the heart of ETPO is our novel per-token soft Bellman update,
designed to harmonize the RL process with the principles of language modeling.
This methodology decomposes the Q-function update from a coarse action-level
view to a more granular token-level perspective, backed by theoretical proof of
optimization consistency. Crucially, this decomposition renders linear time
complexity in action exploration. We assess the effectiveness of ETPO within a
simulated environment that models data science code generation as a series of
multi-step interactive tasks; results show that ETPO achieves effective
performance improvement on the CodeLlama-7B model and surpasses a variant PPO
baseline inherited from RLHF. This underlines ETPO's potential as a robust
method for refining the interactive decision-making capabilities of LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要