SYNTHESIS NOTE

Do tools actually expand what language models can reason about?

Explores whether tool access fundamentally breaks through reasoning limits in pure-text models, or merely optimizes existing capabilities. Understanding this distinction clarifies whether tools are luxury features or necessity for genuine capability growth.

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

Tool-Integrated Reasoning (TIR) — letting a model call a Python interpreter or other external tool mid-reasoning — reliably outperforms pure-text reasoning, but the field has demonstrated this empirically without a principled account of why and when it helps. This paper proves it: TIR enables a strict expansion of both the model's empirical and feasible support, breaking the "invisible leash" that constrains pure-text models. Tools make complex algorithmic strategies practically achievable within finite token budgets — strategies that are otherwise impossible or intractably verbose to express in text alone. Crucially, the advantage is not confined to compute-heavy arithmetic; it extends to problems requiring abstract insight.

On the training side, the paper identifies that reward shaping for TIR is unstable and proposes Advantage Shaping Policy Optimization (ASPO), which directly modifies the advantage function rather than the reward to guide behavior without destabilizing training.

This is the reasoning-side companion to Can models store unlimited facts without growing larger?: one proof concerns factual capacity, this one concerns reasoning reach. Together they give a formal foundation for why agentic harnesses beat bigger models — and they sharpen Does the reasoning cliff depend on how we test models?, which observed empirically that tool access dissolves apparent reasoning ceilings.

Inquiring lines that read this note 11

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do language models learn genuine linguistic structure or just surface patterns?

How does tool integration leverage comprehension without demanding perfect generation?

Do harness improvements transfer across model scales or memorize shortcuts?

What makes API-based scaffolding more trustworthy than direct model access in high-stakes domains?

How effectively do deterministic tools improve language model reasoning on formal tasks?

Why does finetuning cause catastrophic forgetting of model capabilities?

Why does tool use decouple factual capacity from model parameter count?

Why do benchmark improvements fail to reflect actual reasoning quality?

How does evaluation setting affect measured reasoning capabilities in language models?

Which computational strategies best support reasoning in language models?

Can text-space optimization and audit governance coexist in a single skill lifecycle?

Why does verification consistently lag behind AI generation?

What role does verifier design play in reasoning capability gains?

What memory architectures best support persistent reasoning across extended interactions?

What capacity limits does the memory model face as corpus grows?

How does objective evolution guide discovery better than fixed planning?

How does compiling natural language goals into executable code enable objective evolution?

How can AI agents autonomously learn and transfer skills across tasks?

Can skill repositories evolve toward execution-oriented refinement over time?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 109 in 2-hop network ·medium cluster Open in graph ↗

Do tools actually expand what language models ca… Can models store unlimited facts without growing l… Does the reasoning cliff depend on how we test mod… Can modular cognitive tools unlock reasoning witho…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models store unlimited facts without growing larger? Does external tool use let language models recall facts without being constrained by parameter count? This matters because it could reshape how we scale knowledge capacity beyond architectural limits.
companion proof on the factual-capacity axis
Does the reasoning cliff depend on how we test models? If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?
the empirical observation this proof explains
Can modular cognitive tools unlock reasoning without training? Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?
a concrete instantiation of tool-augmented reasoning expanding what the base model can do

Do tools actually expand what language models can reason about?

Inquiring lines that read this note 11

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4