Safe Reinforcement Learning With Linear Function Approximation

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139(2021)

引用 32|浏览56
暂无评分
摘要
Safety in reinforcement learning has become increasingly important in recent years. Yet, existing solutions either fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems, or fail to provide regret guarantees for settings where safety constraints need to be learned. In this paper, we address both problems by first modeling safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold. We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for finite-horizon Markov decision processes (MDPs) with linear function approximation. We show that SLUCB-QVI and RSLUCB-QVI, while with no safety violation, achieve a (O) over tilde (kappa root d(3)H(3)T) regret, nearly matching that of state-of-the-art unsafe algorithms, where k is the duration of each episode, d is the dimension of the feature mapping, kappa is a constant characterizing the safety constraints, and T is the total number of action played. We further present numerical simulations that corroborate our theoretical findings.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要