Distributionally Robust Policy Evaluation under General Covariate Shift in Contextual Bandits
CoRR(2024)
摘要
We introduce a distributionally robust approach that enhances the reliability
of offline policy evaluation in contextual bandits under general covariate
shifts. Our method aims to deliver robust policy evaluation results in the
presence of discrepancies in both context and policy distribution between
logging and target data. Central to our methodology is the application of
robust regression, a distributionally robust technique tailored here to improve
the estimation of conditional reward distribution from logging data. Utilizing
the reward model obtained from robust regression, we develop a comprehensive
suite of policy value estimators, by integrating our reward model into
established evaluation frameworks, namely direct methods and doubly robust
methods. Through theoretical analysis, we further establish that the proposed
policy value estimators offer a finite sample upper bound for the bias,
providing a clear advantage over traditional methods, especially when the shift
is large. Finally, we designed an extensive range of policy evaluation
scenarios, covering diverse magnitudes of shifts and a spectrum of logging and
target policies. Our empirical results indicate that our approach significantly
outperforms baseline methods, most notably in 90
shift-only settings and 72
settings.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要