Intermittent Communications in Decentralized Shadow Reward Actor-Critic.

CDC(2021)

引用 0|浏览4
暂无评分
摘要
Broader decision-making goals such as risk-sensitivity, exploration, and incorporating prior experience motivates the study of cooperative multi-agent reinforcement learning (MARL) problems where the objective is any nonlinear function of the team's long-term state-action occupancy measure, i.e., a general utility, which subsumes the aforementioned goals. Existing decentralized actor-critic algorithms to solve this problem require extensive message passing per policy update, which may be impractical. Thus, we put forth Communication-Efficient Decentralized Shadow Reward Actor-Critic (CE-DSAC) that may operate with time-varying or event-triggered network connectivities. This scheme operates by having agents to alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). CE-DSAC is different from the usual critic update in its local occupancy measure estimation step which is needed to estimate the derivative of the local utility with respect to their occupancy measure, i.e., the "shadow reward," and the amount of local weighted averaging steps executed by agents. This scheme improves existing tradeoffs between communications and convergence: to obtain epsilon-stationarity, we require in O(1/epsilon(2.5)) (Theorem IV.6) or faster O(1/C-2) (Corollary IV.8) steps with high probability. Experiments demonstrate the merits of this approach for multiple RL agents solving cooperative navigation tasks with intermittent communications.
更多
查看译文
关键词
CE-DSAC,communication-efficient decentralized shadow reward actor-critic,convergence,cooperative multiagent reinforcement learning problems,cooperative navigation tasks,decision-making goals,event-triggered network connectivities,intermittent communications,local gradient updates,local occupancy measure estimation step,local utility,local weighted averaging steps,message passing,multiple RL agents,nonlinear function,policy update,risk-sensitivity,ßhadow reward
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要