Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Paper · Source
LLM Evaluations and Benchmarks

Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance, even in the simpler case of one-step contextual bandits. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators. These methods leverage latent structure within actions—made available through action embeddings—to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, which can be controlled by adjusting the amount of convolution.

Introduction. Off-policy estimation (OPE) is a fundamental problem in reinforcement learning and decision making under uncertainty. It involves estimating the expected value of a target policy, given access to only an offline dataset logged by deploying a different policy, often referred as the logging policy (see [46] for a comprehensive survey). This decoupling between data collection and policy evaluation is crucial in many real-world applications, as it allows for the assessment of new policies using historical data without having to deploy them in the environment, which can be costly and/or risky. In this paper, we focus on OPE for the one-step contextual bandit setting, i.e., we perform decision making with only an observed context that is assumed to be independently sampled (e.g., a user coming to a website), and do not consider any recurrent dependencies in the context transitions as is the case in the general formulation of reinforcement learning.

Discussion / Conclusion. In this paper, we proposed the Policy Convolution (PC) family of estimators which leverage latent action structure specified via action embeddings to perform off-policy evaluation in large action spaces. More specifically, PC convolves both the target and logging policies according to an action-action convolution function, which posits a new kind of bias-variance tradeoff controlled by the amount of convolution. Conducting empirical evaluation over a diverse set of off-policy estimation scenarios, we observe that the estimators from the PC framework enjoy up to 5 orders of magnitude improvement over existing baseline estimators in terms of MSE, especially when (1) the action-space is large, (2) the policy mismatch between logging and target policies is high, or (3) the common support assumption for importance sampling is violated. We believe that our findings can expand the potential use of off-policy estimators into new and practical scenarios, and also encourage further exploration into the use of additional structure for efficient OPE. We also discuss limitations and unexplored directions in this paper that we believe are promising for future work.