Provable Benefits of In-Tool Learning for Large Language Models
Tool-augmented language models, equipped with retrieval, memory, or external APIs, are reshaping AI, yet their theoretical advantages remain underexplored. In this paper, we address this question by demonstrating the benefits of in-tool learning (external retrieval) over in-weight learning (memorization) for factual recall. We show that the number of facts a model can memorize solely in its weights is fundamentally limited by its parameter count. In contrast, we prove that tool-use enables unbounded factual recall via a simple and efficient circuit construction. These results are validated in controlled experiments, where tool-using models consistently outperform memorizing ones. We further show that for pretrained large language models, teaching tool-use and general rules is more effective than finetuning facts into memory. Our work provides both a theoretical and empirical foundation, establishing why tool-augmented workflows are not just practical, but provably more scalable.
Introduction. Large Language Models (LLMs) have revolutionized artificial intelligence, redefining how machines understand and generate human language. Beyond this, LLMs are rapidly evolving from static predictors into dynamic, context-aware systems capable of reasoning, adapting, and acting over time. Coding assistants are accelerating software development and lowering the barrier to entry for programming, hinting at the emergence of highly automated, agentic workflows. This transformation is driven by advances in architecture and interaction design. Retrieval-augmented generation (RAG, Lewis et al., 2020b) enables models to access external knowledge in real time, grounding their responses in contextually relevant information. In parallel, memory augmentation, through the use of scratchpads or memory modules, empowers models to organize their reasoning, break down problems, iteratively refine outputs, and maintain coherence across extended sequences.
Discussion / Conclusion. This work provides a unified theoretical and empirical account of the tradeoff between in-weight and in-tool learning. Theoretically, we establish that the number of facts a model can store in its parameters is bounded by its size, implying that scaling knowledge capacity through model enlargement is inherently inefficient. In contrast, models that learn to interact with external tools can access unbounded factual knowledge without increasing their parameter count. Controlled experiments confirm this sharp divide: while in-weight models require everlarger architectures to memorize growing datasets, toolaugmented models exhibit a phase transition, rapidly shifting to rule-based querying once sufficient diversity is observed. This decouples memory capacity from model size. Large-scale experiments extend these findings to pretrained models, showing that in-weight finetuning for factual recall degrades general capabilities, a consequence of limited capacity forcing new information to overwrite prior knowledge. Tool-based approaches, by externalizing factual storage, preserve core skills, reduce training costs, and introduce minimal behavioral drift.