Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
The capabilities and limitations of Large Language Models (LLMs) have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation: train-test set separation. To overcome this, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, a document often has a similar influence across different reasoning questions within the same task, indicating the presence of procedural knowledge.
Introduction. Current advancements in artificial intelligence are characterised by the increasing scale of datasets, computational power, and model size (Kaplan et al., 2020; Hoffmann et al., 2022). While one of the manifestations of this approach, Large Language Models (LLMs), is rapidly saturating benchmarks measuring reasoning capabilities (Cobbe et al., 2021; Hendrycks et al., 2021, inter alia), the debate over whether they exhibit ‘genuine understanding’ is ongoing (as reviewed by Mitchell & Krakauer, 2023). The well-documented robust and versatile reasoning abilities (Webb et al., 2023; 2024; McLeish et al., 2024, inter alia) sharply contrast with the line of work highlighting the brittleness of LLM reasoning (Razeghi et al., 2022; McCoy et al., 2023; Ullman, 2023; Wu et al., 2024; Mahowald et al., 2024). A finding common to these works is that LLM reasoning depends on the frequency of similar problems in the training data. A key reason why benchmark saturation cannot be taken at face value is the issue of data contamination: benchmark data often appear in the pretraining set.
Discussion / Conclusion. In this work, we investigate what kind of generalisation strategy two LLMs (7B and 35B respectively) employ when reasoning, and contrast it to the strategy used for a task that requires retrieving factual parametric knowledge. By creating rankings for 200 such questions over 5 million pretraining documents based on their influence on the likelihood of the completions, we conclude that the generalisation strategy for reasoning is unlike retrieval. More often than not, even if the answer is part of the set of pretraining documents we look at, it does not show up as highly influential as the answers to factual questions do. We find that instead, the positively influential documents often contain procedural knowledge on how to get to a solution. Further, the models rely less on individual documents when reasoning than when answering factual questions, and the set of documents they rely on is more general.