Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability indirectly, by evaluating accuracy on downstream tasks such as mathematical reasoning. However, it is unclear how these models obtain the answers and whether they rely on simple heuristics rather than the generated chain-of-thought. To enable systematic exploration of the reasoning ability of LLMs, we present a new synthetic question-answering dataset called PRONTOQA, where each example is generated from a synthetic world model represented in first-order logic. This allows us to parse the generated chain-ofthought into symbolic proofs for formal analysis. Our analysis on INSTRUCTGPT and GPT-3 shows that LLMs are quite capable of making correct individual deduction steps, and so are generally capable of reasoning, even in fictional contexts. However, they have difficulty with proof planning: When multiple valid deduction steps are available, they are not able to systematically explore the different options.
Introduction. The ability to reason—drawing new conclusions from provided facts—is a hallmark of human intelligence. Recently, chain-of-thought (CoT) prompting has enabled large language models (LLMs) to perform logical reasoning tasks with impressive accuracy (Wei et al., 2022; Chowdhery et al., 2022; Lewkowycz et al., 2022). In CoT prompting, each example consists of a question (e.g., “ 6 3 is 2. 2 −1 is 1.”), and a label (e.g., “1”). When prompted with a few CoT examples, the elicited reasoning allows LLMs to predict the label with much higher accuracy than standard question-answer prompting. However, it is unclear to what extent these models can reason due to several confounding factors. First, existing studies primarily rely on question-answering (QA) tasks from real-world settings such as math word problems (Cobbe et al., 2021; Han et al., 2022; Weston et al., 2016). It is likely that LLMs have already acquired the knowledge through pretraining and simply retrieve the answer rather than reason over it.
Discussion / Conclusion. INSTRUCTGPT will select the incorrect direction with some frequency and is then not able to return to the correct path. Therefore, it seems that while LLMs are able to produce valid proof steps with high probability, they have difficulty with proof planning/strategizing. We were curious if this relationship held in smaller models. We see in figure 5 that smaller models are more prone to make invalid or non-atomic steps as their first non-canonical step. But as model size increases, these types of steps become rarer, and is instead superseded by misleading steps. Looking again at figure 4, we note that many correct proofs also contain misleading steps, and so it must be the case that INSTRUCTGPT sometimes returns to the correct proof path at some point after making a misleading step. To investigate this behavior more closely, we count the number of steps that the model takes after making a misleading step until it produces a step in the gold proof and plot the histogram in figure 10 in the appendix.