Efficient Reinforcement Learning via Large Language Model-based Search

Paper · arXiv 2405.15194 · Published May 24, 2024
Reinforcement LearningReward Models

Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is pronounced if there are stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function specific to each problem is challenging, even for domain experts. They would either have to rely on task-specific domain knowledge or provide an expert demonstration independently for each task. Given, that Large Language Models (LLMs) have rapidly gained prominence across a magnitude of natural language tasks, we aim to answer the following question: Can we leverage LLMs to construct a reward shaping function that can boost the sample efficiency of an RL agent? In this work, we aim to leverage off-the-shelf LLMs to generate a guide policy by solving a simpler deterministic abstraction of the original problem that can then be used to construct the reward shaping function for the downstream RL agent.

Introduction. Sample inefficiency of training Reinforcement Learning (RL) agents in sparse reward domains1 has been a long-standing challenge Ng et al. [1999], Laud and DeJong [2003], Marthi [2007], Grzes and Kudenko [2008], Devlin and Kudenko [2011]. The number of environment interactions further increase if the domain further consists of stochastic transitions Grzes [2017], Ben-Porat et al. [2024]. In an effort to improve this sample complexity, reward shaping has been proven to be effective, which provides intrinsic rewards as a better training signal over just the sparse extrinsic (environment)

Discussion / Conclusion. Task-specific reward shaping to construct intrinsic rewards useful for training Reinforcement Learning agents requires expert domain knowledge and engineering efforts. Obtaining reward shaping functions from domain experts can also lead to the injection of cognitive bias, which can influence the downstream RL training. Lately, Large Language Models have shown remarkable success in a variety of natural language tasks, while also encountering limitations when it comes to prompting them for planning and reasoning domains. Keeping these limitations in mind, we leveraged LLMs in this work to guide RL training in sparse reward tasks across a variety of environments from the BabyAI suite. Instead of relying on prompting off-the-shelf LLMs or fine-tuning them, we proposed an approach to build Model-based critiques that can be augmented with LLMs to verify their guessed outputs and provide feedback in an automated fashion. Using these generated plans on the deterministic abstraction of the original problem, we constructed a Potential-based Reward Shaping function. We noticed a boost in the sample efficiency of downstream RL training on the original sparse reward stochastic problem using PPO and A2C algorithms. From these results, we believe that our work can