Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Paper · Source

Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance, even in the simpler case of one-step contextual bandits. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators. These methods leverage latent structure within actions—made available through action embeddings—to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, which can be controlled by adjusting the amount of convolution.

Introduction. Off-policy estimation (OPE) is a fundamental problem in reinforcement learning and decision making under uncertainty. It involves estimating the expected value of a target policy, given access to only an offline dataset logged by deploying a different policy, often referred as the logging policy (see [46] for a comprehensive survey). This decoupling between data collection and policy evaluation is crucial in many real-world applications, as it allows for the assessment of new policies using historical data without having to deploy them in the environment, which can be costly and/or risky. In this paper, we focus on OPE for the one-step contextual bandit setting, i.e., we perform decision making with only an observed context that is assumed to be independently sampled (e.g., a user coming to a website), and do not consider any recurrent dependencies in the context transitions as is the case in the general formulation of reinforcement learning.

Discussion / Conclusion. In this paper, we proposed the Policy Convolution (PC) family of estimators which leverage latent action structure specified via action embeddings to perform off-policy evaluation in large action spaces. More specifically, PC convolves both the target and logging policies according to an action-action convolution function, which posits a new kind of bias-variance tradeoff controlled by the amount of convolution. Conducting empirical evaluation over a diverse set of off-policy estimation scenarios, we observe that the estimators from the PC framework enjoy up to 5 orders of magnitude improvement over existing baseline estimators in terms of MSE, especially when (1) the action-space is large, (2) the policy mismatch between logging and target policies is high, or (3) the common support assumption for importance sampling is violated. We believe that our findings can expand the potential use of off-policy estimators into new and practical scenarios, and also encourage further exploration into the use of additional structure for efficient OPE. We also discuss limitations and unexplored directions in this paper that we believe are promising for future work.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

What limits mechanistic interpretability's ability to characterize models?

How do mechanistic features compare to natural language for interpretability?

Can language model hallucination be prevented or only managed?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Why is hallucination the wrong term for all LLM false outputs?

How should models express uncertainty rather than forced confident answers?

Can we measure indifference to truth separately from hallucination rates?

What dimensions of recommendation quality do standard metrics miss?

How does AI adoption affect human skill development and labor equality?

Why do interventions for hallucination or automation bias fail to address capability misattribution?

How do language models inherit human biases from training data?

What biases do single large LLM judges introduce into comparisons?

How should we design LLM systems to maintain alignment and control?

What biases might an LLM judge introduce into an on-policy alignment process?

What makes weaker teacher models effective for stronger student training?

Can signal quality regulations help smaller teachers outperform larger ones?

Can ensemble evaluation methods reduce bias more than single judges?

Why do agents confidently report success despite actually failing tasks?

How do we evaluate AI systems when user perception misleads actual performance?

What evaluation criteria can hold across legitimate adoption and coercion?

What drives capability and cost efficiency in agent systems?

What separates good workflow design from poor workflow design?

Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Synthesis notes from this paper's topics 8

Lines of inquiry this paper opens 24