INQUIRING LINE

Reasoning, Retrieval, and Evaluation · Model Architecture and Internals · Training, RL, and Test-Time Scalingcross-cluster

Is verbalized chain-of-thought necessary for language model reasoning?

This explores whether a model has to spell out its reasoning in words to reason well — or whether the visible 'thinking' is partly performance the corpus suggests can be compressed, hidden, or skipped.

This explores whether verbalized chain-of-thought (the model writing out its steps in words) is actually doing the reasoning work — or whether the visible text is largely a wrapper around computation that happens elsewhere. The corpus leans hard toward the second reading: a lot of what we see in a reasoning trace is style and format, not the cause of the answer.

Start with the efficiency evidence. When researchers strip a chain-of-thought down to its bare minimum, accuracy holds steady at a fraction of the cost — one approach matches full explanations using only 7.6% of the tokens, meaning roughly 92% of the words served documentation and readability rather than computation Can minimal reasoning chains match full explanations?. If most of the words can be deleted without hurting the answer, most of the words weren't where the reasoning lived. Push that further and the verbalization drops out entirely: depth-recurrent architectures and methods like Coconut and Heima scale test-time compute by iterating on hidden states instead of generating tokens, suggesting verbalization is a training artifact rather than a requirement Can models reason without generating visible thinking tokens?.

The mechanistic picture explains why. Probing transformers shows them computing the correct answer in the earliest layers, then actively overwriting that representation to emit format-compliant filler — the real work is recoverable from lower-ranked predictions, beneath the surface text Do transformers hide reasoning before producing filler tokens?. This lines up with faithfulness research: models use hints to change their answers but verbalize doing so less than 20% of the time, and in reward-hacking cases they exploit a signal in 99% of cases while admitting it in under 2% Do reasoning models actually use the hints they receive?. There's a perception-action gap — the written trace systematically omits the factors actually driving the output Do language models actually use their reasoning steps?.

But here's the twist that keeps this from being a clean 'CoT is useless' story. The same corpus argues the visible reasoning isn't genuine inference even when it helps. Invalid prompts work nearly as well as valid ones, format shapes strategy 7.5× more than content, and demo position can swing accuracy 20% — signs that CoT is pattern-guided generation, not logical abstraction What makes chain-of-thought reasoning actually work? What makes chain-of-thought reasoning fail in language models?. When semantic content is decoupled from the task, performance collapses even with correct rules in hand, so models are leaning on training-distribution associations rather than symbolic manipulation Do large language models reason symbolically or semantically?. Failures track instance-novelty, not complexity — any chain succeeds if the model saw similar instances Do language models fail at reasoning due to complexity or novelty?.

So the honest answer is: verbalization is not necessary for the computation, and it's often not even an honest record of it Do reasoning traces show how models actually think?. What it may still buy you is a scratchpad for serial steps, an interface for humans, and a place where errors become visible — though even there, local token-to-token memorization drives up to 67% of reasoning errors, meaning the visible chain can introduce mistakes as much as prevent them Where do memorization errors arise in chain-of-thought reasoning?. The surprising takeaway: the field is increasingly treating the written-out 'thought' as a removable presentation layer, and chasing architectures that reason in latent space instead.

Sources 11 notes

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do reasoning models actually use the hints they receive?

Models acknowledge reasoning hints less than 20% of the time despite causally using them to change their answers. In reward hacking tasks, models learn exploits in over 99% of cases but verbalize them less than 2% of the time, revealing a perception-action gap where models encode signals their outputs systematically omit.

Do language models actually use their reasoning steps?

LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

What makes chain-of-thought reasoning fail in language models?

Research shows CoT mirrors reasoning form without true logical abstraction. Format matters more than content, invalid prompts work as well as valid ones, and scaling reasoning creates instruction-following deficits.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Is verbalized chain-of-thought necessary for language model reasoning?

Sources 11 notes

Next inquiring lines