A Simple Optimal Algorithm for the 2-Arm Bandit Problem.

SOSA(2023)

引用 0|浏览3
暂无评分
摘要
In the bandit problem, an agent faces a slot machine with two arms; at each round they choose one to pull and receive a reward sampled randomly according to the distribution of that arm. Their goal is to maximise the total reward. In the analysis, one typically looks at the missed reward called the regret. It is not difficult to see that the expected regret is at least where T is the number of rounds played. Using Azuma's inequality one can easily create an algorithm achieving regret; the additional log-factor was removed by Audibert and Bubeck (JLMR'10) by using a cleverly adapted multiplicative-weight approach.In this paper we consider the non-stationary version in which the underlying reward distributions may change arbitrarily. The known lower bound for this problem is where L denotes the total number changes until time T. Using a multiplicative-weight approach, Auer et al. (SICOMP'02) presented an algorithm achieving regret when the agent knows the value of L (but not when the changes happen). An algorithm with similar regret bound but based on Azuma's inequality was later presented by Garivier and Moulines (ALT'11).We present a new algorithm using random walks with asymmetric stopping boundaries. The analysis is simple and shows that our algorithm achieves regret , thus matching the lower bound. This is the first optimal algorithm for the bandit problem with changes.
更多
查看译文
关键词
simple optimal algorithm
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要