Does the token prediction framing actually capture what human reasoning does?
This explores whether 'the model is just predicting the next token' describes what's actually happening when a model reasons — or whether that framing hides something the corpus complicates.
This explores whether 'just predicting the next token' captures reasoning, or whether the visible stream of tokens is a misleading place to look. The corpus splits into two answers that are both interesting — and the tension between them is the real story.
On one side, several lines argue that reasoning genuinely *emerges* from token prediction, not on top of it. Quiet-STaR trains models to generate a rationale at every token position on arbitrary text, judging the rationale purely by whether it improves the next-token prediction — reasoning shows up as a side effect of better language modeling Can models learn reasoning from predicting any text?. Reinforcement Pre-Training goes further and reframes next-token prediction itself as a reasoning task, using the corpus as its own verifier Can next-token prediction become a reasoning task with RL?, and a related approach treats chain-of-thought as an exploratory action during pretraining, rewarded by how much it improves prediction Can chain-of-thought reasoning be learned during pretraining itself?. In this view, token prediction isn't a poor proxy for reasoning — done at scale, it *is* where reasoning gets installed.
But the same corpus undercuts the idea that the *visible tokens* are the reasoning. Models trained on deliberately corrupted, logically irrelevant traces perform about as well as those trained on correct ones — suggesting the trace works as computational scaffolding, not as a record of thought Do reasoning traces need to be semantically correct?. And when you push a model outside its training distribution, the chain-of-thought stays fluent while the logic quietly fails: it imitates the *form* of reasoning without the underlying validity Does chain-of-thought reasoning actually generalize beyond training data?. So the tokens can look like reasoning while doing none of it — exactly the gap the question is pointing at.
Stranger still, the real computation often isn't in the tokens at all. Logit-lens analysis shows transformers can compute the correct answer in early layers and then actively *overwrite* it with format-compliant filler before emitting tokens Do transformers hide reasoning before producing filler tokens?. And whole families of models — depth-recurrent architectures, Coconut, Heima — scale their reasoning by iterating in continuous latent space with no verbalized steps at all, implying that verbalization is a training artifact rather than a requirement for thinking Can models reason without generating visible thinking tokens?. The token stream, on this evidence, is closer to a *report* of reasoning than the reasoning itself.
The payoff: the corpus doesn't treat the token stream as uniform anyway. Only ~20% of tokens are high-entropy 'forking points' that actually carry the learning signal Do high-entropy tokens drive reasoning model improvements?, specific words like 'Wait' and 'Therefore' spike in mutual information with the right answer Do reflection tokens carry more information about correct answers?, and models internally rank tokens by function, protecting symbolic computation while discarding grammar and filler Which tokens in reasoning chains actually matter most?. So 'token prediction' turns out to be too coarse a description in both directions — most tokens aren't doing reasoning work, and the tokens that matter may be pointers to computation happening somewhere the words can't show you. Whether that resembles *human* reasoning is left open, but the framing clearly isn't capturing the thing it appears to name.
Sources 10 notes
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.
Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.