Time-Varying Weights in Multi-Reward Architecture for Deep Reinforcement Learning

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE(2024)

Cited 0|Views7
No score
Abstract
Deep Reinforcement Learning (DRL) has recently been focused on extracting more knowledge from the reward signal to improve sample efficiency. The Multi-Reward Architecture (MRA) achieves this by breaking down the original reward function into multiple sub-reward branches and training a source-specific policy branch for each one. However, existing MRAs treat all source-specific policy branches as equally important or assign a constant level of importance based on current task conditions, which hinders DRL agents from prioritizing the most important branch at different task stages. Additionally, this necessitates a manual and time-consuming reset of weights for each branch when task conditions change. Thus, it is crucial to automate the time-varying importance assignment for branches. We propose a generic MRA approach to achieve this goal, which can be applied to improve state-of-the-art (SOTA) MRA methods. Firstly, we add a policy branch corresponding to the original reward function, allowing MRA to learn from each sub-reward branch without losing the experience provided by the original reward. Then, we apply the Asynchronous Advantage Actor Critic (A3C) algorithm to learn time-varying weights for all policy branches. These weights are then shaped into a one-hot vector to select the suitable policy branch for producing an action. Extensive experiments have demonstrated that the proposed method effectively improves three SOTA MRA methods across four tasks in terms of episode reward, success rate, score difference, and episode duration.
More
Translated text
Key words
Task analysis,Computer architecture,Robots,Pain,Compounds,Training,Reinforcement learning,Deep reinforcement learning,multi-reward architecture,time-varying importance,A3C algorithm
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined