Does text-only evaluation hide reasoning collapse that tool use could repair?
This explores whether the dramatic 'reasoning collapses' we see in language models are real limits of thinking — or just artifacts of testing them in text alone, where giving them tools to actually run the steps would close the gap.
This explores whether text-only testing manufactures a fake 'reasoning cliff' — and whether tool use is the repair. The corpus answers fairly directly: yes, and largely yes. Two notes argue the collapse is misdiagnosed. The headline claim is that what looks like a reasoning failure is really an execution failure — models often *know* the algorithm but can't carry out a long multi-step procedure by hand in text, the way you'd know how to do long division but lose track doing a 40-digit problem in your head Are reasoning model collapses really failures of reasoning?. The companion finding sharpens it: the cliff moves depending on how you test. Give models tool access and the catastrophic drop-off disappears, which means text-only benchmarks systematically *underestimate* what models can do in the real world Does the reasoning cliff depend on how we test models?.
Why would text itself be the bottleneck? One note reframes it philosophically: text-only models are 'Plato's cave' systems — language is a lossy compression of reality that strips out physics, geometry, and causality, so the model manipulates symbols with no grounding in the dynamics they came from Are text-only language models fundamentally limited by abstraction?. Tools (a calculator, a code interpreter, a simulator) hand back exactly the grounded execution that text drops. That's the mechanism by which tool use 'repairs' the collapse — it's not making the model smarter, it's offloading the bookkeeping that text-generation does badly.
But the corpus also complicates the optimistic read, and this is where it gets interesting. If failures were purely about execution bandwidth, tools would fix everything. Yet another line of work finds reasoning breaks down at *instance-level unfamiliarity*, not task complexity — models fit patterns from training instances rather than learning a general algorithm, so a problem fails when it's novel, regardless of length Do language models fail at reasoning due to complexity or novelty?. A related critique argues chain-of-thought is 'constrained imitation' — reproducing familiar reasoning *forms* rather than genuine inference, which is why it degrades under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Why does chain-of-thought reasoning fail in predictable ways?. Tools repair the kind of collapse that's procedural; they don't repair the kind that's the model never having generalized the algorithm in the first place. So the honest answer is: text-only evaluation hides *a* collapse that tools repair, while sitting on top of a deeper one they don't.
There's a second twist worth knowing about — text may also hide reasoning the model *did* do. Logit-lens work shows transformers can compute the correct answer in early layers, then overwrite it with format-compliant filler before output, so the visible text understates the internal computation Do transformers hide reasoning before producing filler tokens?. And other work shows reasoning can scale in latent space entirely without verbalized steps, suggesting the written chain is partly a training artifact rather than the reasoning itself Can models reason without generating visible thinking tokens?. Read together, the text channel is unreliable in both directions: it makes capable models look like they failed (execution), and makes their actual computation invisible (overwriting/latent reasoning).
The practical upshot, and the thing you may not have known you wanted: this is really an argument about *evaluation design*, not model capability. If a benchmark forbids tools, it isn't measuring reasoning — it's measuring reasoning-plus-mental-arithmetic-stamina, and reporting the sum as the former. That's why the corpus is converging on richer evaluation — for instance, agentic judges that gather evidence dynamically rather than scoring a single text blob, cutting evaluation error by orders of magnitude Can agents evaluate AI outputs more reliably than language models?. The lesson generalizes past the cliff: how you let a model work during a test silently decides what the test appears to prove.
Sources 9 notes
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Language models show catastrophic failure in text-only reasoning benchmarks but maintain scaling when given tool access. The cliff reflects execution constraints, not reasoning capability, making text-only evaluations systematically underestimate real-world performance.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.