INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›Can next-token prediction alone pr…›this inquiring line

Is 'it just predicts the next word' a real explanation of AI reasoning — or a phrase that hides something more interesting?

Does the token prediction framing actually capture what human reasoning does?

This explores whether 'the model is just predicting the next token' describes what's actually happening when a model reasons — or whether that framing hides something the corpus complicates.

This explores whether 'just predicting the next token' captures reasoning, or whether the visible stream of tokens is a misleading place to look. The corpus splits into two answers that are both interesting — and the tension between them is the real story.

On one side, several lines argue that reasoning genuinely *emerges* from token prediction, not on top of it. Quiet-STaR trains models to generate a rationale at every token position on arbitrary text, judging the rationale purely by whether it improves the next-token prediction — reasoning shows up as a side effect of better language modeling Can models learn reasoning from predicting any text?. Reinforcement Pre-Training goes further and reframes next-token prediction itself as a reasoning task, using the corpus as its own verifier Can next-token prediction become a reasoning task with RL?, and a related approach treats chain-of-thought as an exploratory action during pretraining, rewarded by how much it improves prediction Can chain-of-thought reasoning be learned during pretraining itself?. In this view, token prediction isn't a poor proxy for reasoning — done at scale, it *is* where reasoning gets installed.

But the same corpus undercuts the idea that the *visible tokens* are the reasoning. Models trained on deliberately corrupted, logically irrelevant traces perform about as well as those trained on correct ones — suggesting the trace works as computational scaffolding, not as a record of thought Do reasoning traces need to be semantically correct?. And when you push a model outside its training distribution, the chain-of-thought stays fluent while the logic quietly fails: it imitates the *form* of reasoning without the underlying validity Does chain-of-thought reasoning actually generalize beyond training data?. So the tokens can look like reasoning while doing none of it — exactly the gap the question is pointing at.

Stranger still, the real computation often isn't in the tokens at all. Logit-lens analysis shows transformers can compute the correct answer in early layers and then actively *overwrite* it with format-compliant filler before emitting tokens Do transformers hide reasoning before producing filler tokens?. And whole families of models — depth-recurrent architectures, Coconut, Heima — scale their reasoning by iterating in continuous latent space with no verbalized steps at all, implying that verbalization is a training artifact rather than a requirement for thinking Can models reason without generating visible thinking tokens?. The token stream, on this evidence, is closer to a *report* of reasoning than the reasoning itself.

The payoff: the corpus doesn't treat the token stream as uniform anyway. Only ~20% of tokens are high-entropy 'forking points' that actually carry the learning signal Do high-entropy tokens drive reasoning model improvements?, specific words like 'Wait' and 'Therefore' spike in mutual information with the right answer Do reflection tokens carry more information about correct answers?, and models internally rank tokens by function, protecting symbolic computation while discarding grammar and filler Which tokens in reasoning chains actually matter most?. So 'token prediction' turns out to be too coarse a description in both directions — most tokens aren't doing reasoning work, and the tokens that matter may be pointers to computation happening somewhere the words can't show you. Whether that resembles *human* reasoning is left open, but the framing clearly isn't capturing the thing it appears to name.

Sources 10 notes

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Show all 10 sources

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens4.27 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models3.41 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains2.56 match · arxiv ↗
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs2.56 match · arxiv ↗
Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity2.54 match · arxiv ↗
RLP: Reinforcement as a Pretraining Objective1.82 match · arxiv ↗
Base Models Know How to Reason, Thinking Models Learn When1.76 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.76 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-evaluating whether next-token prediction genuinely captures reasoning, or whether the visible token stream is an artifact of a deeper computation. This question remains open despite recent advances.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and include:
- Token prediction at scale *installs* reasoning as an emergent byproduct; chain-of-thought acts as exploratory action during pretraining, rewarded by prediction improvement (~2024–2025).
- Deliberately corrupted reasoning traces perform nearly as well as correct ones, suggesting tokens are computational scaffolding rather than a faithful thought record (~2025).
- Transformers compute correct answers in early layers, then actively *overwrite* them with format-compliant output before emitting tokens; latent reasoning in continuous space (depth-recurrent, Coconut, Heima) scales without verbalized steps (~2025–2026).
- Only ~20% of tokens are high-entropy 'forking points' that drive learning; 'thinking tokens' spike in mutual information with ground truth; models rank tokens by functional importance (~2025–2026).
- Chain-of-thought reasoning is distribution-bounded — it imitates form while logic silently fails outside training distribution (~2025).

Anchor papers (verify; mind their dates):
- Quiet-STaR (2024-03, arXiv:2403.09629)
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens (2025-05, arXiv:2505.13775)
- Is Chain-of-Thought Reasoning of LLMs a Mirage? (2025-08, arXiv:2508.01191)
- Do LLMs Encode Functional Importance of Reasoning Tokens? (2026-01, arXiv:2601.03066)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, assess whether newer training regimes (constitutional AI, process supervision, outcome-only RL), architectural innovations (MoE routing, attention factorization), or evaluation methods (formal verification, human reasoning traces) have since *relaxed* the apparent limits. Separate the durable question (do tokens capture reasoning?) from perishable limitations (e.g., 'only 20% of tokens matter' — does this still hold with adaptive token allocation?). Cite what resolved or reinforced each constraint.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months. Does any recent paper argue that token-level prediction *does* faithfully capture human reasoning, or that the latent-space view misses something crucial about how reasoning scales?
(3) Propose 2 research questions that *assume the regime has shifted*: (a) If tokens are a low-fidelity report of reasoning, what intervention (training objective, loss term, architecture) best aligns the visible stream with the hidden computation? (b) Can you design a model that *verifiably* reasons in the token stream rather than beyond it — and if so, does it lose efficiency or capability?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Is 'it just predicts the next word' a real explanation of AI reasoning — or a phrase that hides something more interesting?

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8