Understanding Tool-Integrated Reasoning

Paper · arXiv 2508.19201 · Published August 26, 2025
Reinforcement Learning

Abstract: We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM’s capabilities. We demonstrate that tools enable a strict expansion of the model’s empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@kmetric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight.

Introduction. Large language models (LLMs) have rapidly progressed from fluent generators to general-purpose problem solvers. Nevertheless, purely text-based reasoning often struggles with tasks that demand precise calculation, long-horizon search, faithful verification, or access to information beyond a model’s parametric memory. As a powerful and empirically successful paradigm, Tool-Integrated Reasoning (TIR) [3, 8] has emerged to address these limitations. Systems equipped with external tools have consistently and significantly outperformed their pure-text counterparts[9, 10, 18]. However, despite the widespread recognition of TIR’s effectiveness, a principled account of the fundamental mechanisms, specifically why and when it helps, is still missing. Existing research has largely focused on demonstrating empirical success, leaving a crucial gap for a formal framework that can elucidate the origins of its benefits and define its capability boundaries. To build such a framework, we first turn to reinforcement learning (RL) [6, 13], the predominant paradigm for enhancing LLM reasoning.

Discussion / Conclusion. In this work, we presented a comprehensive investigation into the foundational mechanisms of Tool- Integrated Reasoning (TIR). We moved beyond empirical demonstrations to establish a formal theoretical framework explaining its effectiveness. Our core theoretical contribution is the proof that TIR enables a strict expansion of both the empirical and feasible support of an LLM, breaking the “invisible leash” that constrains pure-text models and making complex algorithmic strategies practically achievable within finite token budgets. On the algorithmic front, we identified the instability of reward shaping for guiding model behavior in TIR systems and proposed Advantage Shaping Policy Optimization (ASPO), a stable and effective alternative that directly modifies the advantage function. Our experiments provided strong empirical validation for these claims.