What is the relationship between reasoning depth and verbalization requirements?
This explores whether reasoning that goes 'deeper' actually has to be spelled out in words — or whether the visible chain of thought is separate from the thinking itself, and how much of it you really need.
This question asks how reasoning depth relates to how much a model has to say out loud — and the corpus pulls apart what most people assume is one thing into two. The surprising thread running through several notes is that verbalization is largely *decoupled* from reasoning depth. Multiple architectures show models can scale up their thinking entirely in hidden states, never emitting a single 'thinking' token — depth-recurrent models, Heima, and Coconut all add reasoning depth by iterating internal representations, which suggests that writing out steps is a training habit, not a requirement for going deeper Can models reason without generating visible thinking tokens?. Meta's Large Concept Model pushes the same idea up a level, reasoning over whole-sentence embeddings in a language-agnostic space before any words get decoded Can reasoning happen at the sentence level instead of tokens?.
So if verbalization isn't *required* for depth, what does the verbalized chain actually buy you? Here the corpus gets sharply non-monotonic. Accuracy doesn't rise with length — it follows an inverted U. Optimal chain-of-thought length grows with task difficulty but *shrinks* as the model gets more capable, and RL training naturally drifts toward shorter chains as models improve Why does chain of thought accuracy eventually decline with length?. That means past a point, more verbalization hurts. You can even treat verbosity as a single steerable direction in activation space and cut chain length by two-thirds with no accuracy loss Can we steer reasoning toward brevity without retraining?. Verbalized length, in other words, is a knob — not a measure of how hard the model is thinking.
The darker side: longer verbalized chains aren't free, because each spoken step is a place where things can go wrong. Reasoning LLMs tend to *wander* unsystematically rather than search, so success drops exponentially as problem depth grows Why do reasoning LLMs fail at deeper problem solving?. And every extra verbalized step is an attack surface — manipulative multi-turn prompts knock 25–29% off reasoning-model accuracy precisely because extended chains create more intervention points where one corrupted step propagates Why do reasoning models fail under manipulative prompts?. More words to reason through can mean more ways to derail.
There's a real tension worth noticing, though. Verbalization isn't only overhead — it's also where *grounding* happens. ReAct shows that interleaving spoken reasoning with real-world tool queries injects feedback at each step and prevents the error propagation that pure internal reasoning can't catch, beating chain-of-thought by 10–34% on knowledge tasks Can interleaving reasoning with real-world feedback prevent hallucination?. So the verbalized step earns its cost when it connects to something external. And several apparent 'depth' failures turn out not to be reasoning failures at all: models often know the algorithm but can't *execute* it across enough text-bound steps, and giving them tools dissolves the supposed reasoning cliff Are reasoning model collapses really failures of reasoning?.
The thing you didn't know you wanted to know: depth and verbalization point in *opposite* directions for capable models. Raw reasoning capacity already sits latent in base models waiting to be elicited Do base models already contain hidden reasoning ability?, and better models reach further with *fewer* spoken steps. The frontier question becomes not 'how much should the model write out' but 'what kind of structure does the depth need' — for instance, allocating compute to diverse abstractions forces a breadth-first search that beats simply thinking longer down one chain Can abstractions guide exploration better than depth alone?.
Sources 10 notes
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.