INQUIRING LINE

Can standard next-token prediction capture complex multi-step human reasoning directly?

This explores whether plain next-token prediction — the basic 'guess the next word' objective — can on its own produce genuine multi-step reasoning, or whether reasoning has to be added through some other mechanism.


This explores whether plain next-token prediction — predicting the next word, one at a time — can by itself capture real multi-step human reasoning, or whether reasoning has to be grafted on through other means. The corpus gives a split verdict: the raw objective seems insufficient, but it's a richer foundation than it first appears, and most of the action is in how you shape what gets predicted. A striking finding is that standard models trained only to predict text already do hidden reasoning — one analysis shows transformers compute correct answers in their early layers and then actively overwrite that with format-compliant filler tokens, so the reasoning is present but suppressed by the surface objective Do transformers hide reasoning before producing filler tokens?. That hints the prediction objective isn't the bottleneck so much as what the model is rewarded to surface.

Several threads argue you can coax reasoning out of pure prediction without changing the architecture at all. Quiet-STaR trains a model to generate a private rationale at every token position on ordinary internet text, judging the rationale by whether it improves the next-token prediction — so reasoning emerges as a side effect of better language modeling Can models learn reasoning from predicting any text?. Reinforcement Pre-Training reframes next-token prediction as a reasoning task by treating the corpus itself as a verifiable reward signal Can next-token prediction become a reasoning task with RL?, and RLP plants chain-of-thought during pretraining using the model's own log-likelihood gain as a verifier-free reward Can chain-of-thought reasoning be learned during pretraining itself?. Even just seeding training data with 'lookahead' tokens that smuggle in future information lets a vanilla model learn planning without any architectural change Can embedding future information in training data improve planning?. The common move: keep next-token prediction, but enrich the target so the gradient flows toward reasoning.

The skeptical camp says the surface form fools us. Chain-of-thought, it turns out, is largely constrained imitation of reasoning patterns seen in training rather than genuine inference — performance degrades predictably the moment you shift task, length, or format away from the training distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data?. Probing further, when you strip familiar meaning out of a problem and leave only the logical structure, models collapse — they reason through semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?. And reasoning quietly falls apart with longer inputs well before the context window fills, dropping from 92% to 68% accuracy with just a few thousand tokens of padding Does reasoning ability actually degrade with longer inputs?. So 'direct' multi-step reasoning from prediction is partly real, partly mimicry that breaks under stress.

What sharpens the picture is that the reasoning signal lives in a tiny minority of tokens. Only about 20% of tokens are high-entropy 'forking points' where the model genuinely decides, and training on just those matches full-gradient performance Do high-entropy tokens drive reasoning model improvements?. Models even internally rank tokens by function, preferentially preserving symbolic-computation tokens while discarding grammar and filler Which tokens in reasoning chains actually matter most?. This reframes the whole question: next-token prediction treats every token equally, but reasoning concentrates in a handful of pivotal choices — which is why merely predicting text well doesn't automatically yield reliable reasoning.

The takeaway you might not have expected: the debate isn't really 'prediction vs. reasoning' but how much structure you inject into the prediction target. Pure next-token prediction already contains latent reasoning it then hides, and training regime — not raw compute — decides whether that latent capacity becomes usable; non-reasoning models can't close the gap no matter how much inference budget you throw at them, because the reasoning protocol has to be instilled during training Can non-reasoning models catch up with more compute? Can chain-of-thought reasoning be learned during pretraining itself?. So the honest answer is: not cleanly on its own, but the gap is closed by reshaping what the model predicts, not by abandoning prediction.


Sources 12 notes

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can embedding future information in training data improve planning?

TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capabilities researcher re-evaluating whether next-token prediction alone can support genuine multi-step reasoning in LLMs — a question that remains open despite rapid progress. The constraint you're testing: does the raw supervised objective capture reasoning directly, or must reasoning be engineered into the training target?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot at its date.
- Transformers compute correct reasoning in early layers but actively suppress it with filler tokens, implying the bottleneck is what gets rewarded to surface, not prediction itself (2024-12).
- Chain-of-thought emerges from pure next-token prediction when training includes private rationales scored by prediction gain, without architectural changes (2024-03).
- Chain-of-thought degrades predictably off-distribution (task shift, length shift, format shift) and breaks under input padding well before context window, suggesting constrained imitation rather than abstract reasoning (2024-02, 2025-08).
- Only ~20% of tokens are high-entropy 'forking points' where genuine decisions occur; training on just those matches full-gradient performance (2025-06).
- Non-reasoning models cannot match reasoning-trained models even with unlimited inference budget; the reasoning protocol must be instilled at training time (2025-04).

Anchor papers (verify; mind their dates):
- arXiv:2403.09629 (Quiet-STaR, 2024-03) — rationale generation as token-level side effect
- arXiv:2412.04537 (Hidden Computations, 2024-12) — early-layer reasoning suppression
- arXiv:2508.01191 (CoT as Mirage, 2025-08) — distribution-bounded imitation
- arXiv:2506.01939 (High-Entropy Tokens, 2025-06) — functional concentration in reasoning

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1-family, r1-class reasoners, 2026+ releases), improved training (DPO, PPO variants on reasoning, curriculum learning), inference-time tools (process reward models, verification oracles), or orchestration (multi-agent debate, external symbolic engines) have since RELAXED or OVERTURNED it. Separate the durable question—*Can next-token prediction be the sole inductive principle for reasoning?*—from perishable limitations (e.g., *CoT breaks at 10K tokens*). Cite what resolved each constraint, and say plainly where it still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (2026-04 onward). Does any recent paper claim reasoning models are not fundamentally different from predictive models, or that pure prediction plus scale now suffices?

(3) Propose 2 research questions that ASSUME the training regime may have shifted since 2026-01:
   - How does the token-functional hierarchy change under test-time scaling (inference-time search vs. training-time reasoning injection)?
   - Can you measure whether reasoning bottleneck has migrated from *which tokens matter* to *which supervision signal teaches token selection*?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines