Chain-of-thought Reasoning Is A Policy Improvement Operator

Paper · arXiv 2309.08589 · Published September 15, 2023
Chain-of-Thought and Reasoning MethodsReward Models

Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chainof-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to 29-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017).

Introduction. Large language models are currently trained on vast corpora of human-generated data (Vaswani et al., 2023; Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023). While large language models have demonstrated many surprising capabilities, the possibility of reaching superhuman performance is a challenging proposition when training solely on existing data. In this paper, we ask whether large language models can autonomously teach themselves new skills rather than solely depending on the availability of suitable data. A positive answer to this question would open the door to a tantalizing possibility. Although the discovery of scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) for language models has created much excitement for training increasingly larger language models, these models have already consumed a significant fraction of the high-quality (textual) data on the internet.

Discussion / Conclusion. We demonstrate that chain-of-thought reasoning can serve as a policy improvement operator and show a proof-of-concept demonstration that language models can teach themselves addition. In contrast to prior efforts in which self-improvement fails after only a few steps at most, models trained with SECToR manage to stay on pace for over twenty steps. Nevertheless, numerous avenues remain unexplored in the context of self-learning with language models. Limitations. While SECToR demonstrates the possibility of self-learning in addition with language models, it is far from showing that models can self-learn in general. A natural question is whether methods like SECToR can generalize to more complex tasks, such as multiplication or perhaps even general mathematics or programming. Secondly, models trained with SECToR do not improve forever. We speculate that a larger model, or a stronger consistency check, might allow for the models to continue improving beyond 29 digits.