Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation
arxiv(2023)
Abstract
Off-Policy Evaluation (OPE) aims to assess the effectiveness of
counterfactual policies using only offline logged data and is often used to
identify the top-k promising policies for deployment in online A/B tests.
Existing evaluation metrics for OPE estimators primarily focus on the
"accuracy" of OPE or that of downstream policy selection, neglecting
risk-return tradeoff in the subsequent online policy deployment. To address
this issue, we draw inspiration from portfolio evaluation in finance and
develop a new metric, called SharpeRatio@k, which measures the risk-return
tradeoff of policy portfolios formed by an OPE estimator under varying online
evaluation budgets (k). We validate our metric in two example scenarios,
demonstrating its ability to effectively distinguish between low-risk and
high-risk estimators and to accurately identify the most efficient one.
Efficiency of an estimator is characterized by its capability to form the most
advantageous policy portfolios, maximizing returns while minimizing risks
during online deployment, a nuance that existing metrics typically overlook. To
facilitate a quick, accurate, and consistent evaluation of OPE via
SharpeRatio@k, we have also integrated this metric into an open-source
software, SCOPE-RL (https://github.com/hakuhodo-technologies/scope-rl).
Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking
experiments on various estimators and RL tasks, focusing on their risk-return
tradeoff. These experiments offer several interesting directions and
suggestions for future OPE research.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined