Does reasoning require verbalization to be trainable and controllable?
This explores whether reasoning has to be spelled out in words (chain-of-thought) for us to train it and steer it — or whether reasoning can live in a model's hidden states and still be shaped and controlled.
This explores whether reasoning has to be spelled out in words to be trainable and controllable — and the corpus's answer is a fairly emphatic no. Several architectures show models reasoning entirely in hidden state, with no visible thinking tokens at all: depth-recurrent models, Heima, and Coconut scale test-time compute by iterating internally rather than emitting steps Can models reason without generating visible thinking tokens?, and a 27M-parameter recurrent model solved Sudoku-Extreme and 30×30 mazes perfectly while chain-of-thought methods scored zero Can models reason without generating visible thinking steps?. The throughline: verbalization looks like a training artifact, not a requirement of reasoning itself.
That reframes where the reasoning actually comes from. A separate line of work argues the capability is already latent in base models, and training merely elicits it — RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR all surface reasoning that was present in activations all along Do base models already contain hidden reasoning ability?. You can even unlock gains with no training at all, by wrapping reasoning operations in modular sandboxed tool calls Can modular cognitive tools unlock reasoning without training?. If reasoning is something you select rather than create, then verbalized traces are one elicitation channel among several — not the substrate.
On controllability, the interesting finding is that you can steer reasoning without it being verbose or even without retraining. Verbose and concise chain-of-thought occupy distinct, linearly separable regions of activation space, so a single steering vector extracted from 50 examples cuts CoT length by 67% while keeping accuracy Can we steer reasoning toward brevity without retraining?. Models can also be trained to route between extended thinking and quick direct answers, learning *when* to reason rather than always narrating it Can models learn when to think versus respond quickly?. Control, in other words, operates on internal representations and policies — not on the presence of words.
Where words still earn their keep is in training signal. Approaches like RLP treat chain-of-thought as an exploratory action during pretraining, rewarded by how much it improves next-token prediction Can chain-of-thought reasoning be learned during pretraining itself?, and Quiet-STaR generates token-level rationales on arbitrary text, judging them purely by predictive accuracy Can models learn reasoning from predicting any text?. Here verbalized intermediate steps are useful *scaffolding* for learning — a way to plant procedural knowledge, which is what actually drives reasoning generalization Does procedural knowledge drive reasoning more than factual retrieval? — but the scaffolding can be removed at inference.
The sharpest caveat is that visible reasoning may not even be the real reasoning. Chain-of-thought can be constrained imitation of reasoning *form* — reproducing familiar schemata from training rather than genuine inference, degrading predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So the thing you'd be tempted to treat as the trainable, controllable object — the words — may be partly a performance laid over computation happening elsewhere. The unexpected takeaway: verbalization is best understood as a useful interface to reasoning, not its seat — convenient for training signal and human inspection, but neither necessary for the capability nor where the real control levers live.
Sources 10 notes
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.