INQUIRING LINE

How does predictive accuracy on future tokens differ from correctness on labeled answers?

This explores the gap between two things we might call 'right': a model predicting what token comes next in a corpus (self-supervised likelihood), versus a model landing on the answer a label says is correct — and what the corpus reveals about when those two come apart.


This explores the difference between predictive accuracy on future tokens — how well a model guesses the next word from raw text — and correctness on labeled answers, where an external signal says 'this is the right solution.' The two look similar (both reward 'getting it right') but the corpus shows they pull on different parts of the model, and the most interesting work lives in the gap between them.

The cleanest bridge between the two is Reinforcement Pre-Training Can next-token prediction become a reasoning task with RL?, which turns next-token prediction itself into a verifiable task: the corpus *is* the label, so predicting the next token becomes a reasoning problem with a built-in correctness check. That reframing matters because it exposes what ordinary pretraining hides — that not all tokens carry the same weight. Only about 20% of tokens are high-entropy 'forking points' where reasoning actually branches, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. Specific tokens like 'Wait' and 'Therefore' spike in mutual information with the correct answer, and suppressing them wrecks accuracy while suppressing random tokens doesn't Do reflection tokens carry more information about correct answers?. So token-level prediction and answer-level correctness aren't uniformly linked — a minority of tokens carries almost all the signal that connects the two.

The more surprising finding is how far the two can decouple. Models trained on *deliberately corrupted* reasoning traces stay just as accurate on final answers, and sometimes generalize better Do reasoning traces need to be semantically correct? — meaning the tokens a model emits don't have to be semantically true for the labeled answer to come out right. They function as computational scaffolding, not meaning. The flip side: transformers can compute the correct answer in their first few layers and then actively *overwrite* it with format-compliant filler before producing output Do transformers hide reasoning before producing filler tokens?. Internal correctness and the tokens you actually predict can diverge inside a single forward pass.

This is why optimizing for one doesn't cleanly buy you the other. Longer chains of thought (more predicted tokens) help accuracy only up to a point, then hurt — an inverted-U, with RL naturally driving toward shorter chains as models improve Why does chain of thought accuracy eventually decline with length?. And a model's confidence in its own predicted tokens is a poor proxy for correctness: models systematically over-trust answers they generated themselves, because high-probability tokens simply *feel* right during self-evaluation Why do models trust their own generated answers?. The fix isn't better prediction — it's calibration. Models trained with uncertainty-aware objectives that let them *abstain* when unsure can match models 10x their size Can models learn to abstain when uncertain about predictions?.

The thing worth taking away: predictive accuracy is about plausibility (does this token fit the distribution?), while labeled correctness is about truth (does this answer match reality?). The corpus suggests the real engineering lever is learning *where* in the token stream those two coincide — the forking tokens, the lookahead signals you can embed in training data Can embedding future information in training data improve planning? — rather than assuming a fluent prediction and a correct answer are the same achievement.


Sources 9 notes

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can embedding future information in training data improve planning?

TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating claims about the gap between next-token prediction accuracy and labeled-answer correctness in LLMs. The question remains open: do these two objectives measure the same capability, or fundamentally different ones?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–May 2026; treat as perishable.

• Only ~20% of tokens are high-entropy 'forking points' where reasoning branches; training on just those matches full pretraining updates (Jun 2025).
• Models trained on deliberately corrupted reasoning traces stay as accurate on final answers and sometimes generalize better — tokens function as computational scaffolding, not meaning (May 2025).
• Transformers compute correct answers in early layers, then *overwrite* them with format-compliant filler before output (Dec 2024).
• Optimal chain-of-thought length follows an inverted-U; RL naturally drives toward shorter chains as models improve (Feb 2025).
• Models systematically over-trust tokens they generated themselves; calibration (letting models abstain) matches 10× larger models (Feb 2024).

Anchor papers (verify; mind their dates):
• arXiv:2506.09038 (AbstentionBench, Jun 2025)
• arXiv:2505.13775 (Reasonless Intermediate Tokens, May 2025)
• arXiv:2412.04537 (Hidden Computations, Dec 2024)
• arXiv:2510.01265 (RLP, Sep 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 20% high-entropy claim, corrupted-traces finding, and the inverted-U CoT length pattern: have newer model scaling laws, instruction-tuning variants, or multi-step verifiers since relaxed or overturned these? Separate the durable question ('are token prediction and answer correctness decoupled?') from perishable limits ('this decoupling is permanent'). Cite what resolved or confirmed each.

(2) Surface contradicting or superseding work from the last ~6 months showing either that token prediction and answer correctness *do* align tightly, or that the gap you found was an artifact of training method, not fundamental.

(3) Propose 2 research questions assuming the regime may have moved: (a) Does end-to-end RL on answer correctness gradually *re-couple* token prediction and truth? (b) Can interpretability methods localize where the overwrite happens and edit it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines