INQUIRING LINE

Why does test accuracy improve after training accuracy reaches 100 percent?

This explores 'grokking' — the puzzle where a model keeps improving on held-out test data long after it has perfectly memorized the training set — and what the corpus says about why training-set accuracy is such a poor stand-in for real generalization.


This explores the grokking puzzle: training accuracy hits 100%, yet test accuracy keeps climbing, as if the model goes on learning something after it has already 'won' on the data it can see. None of the notes here studies grokking by name, so treat this as a lateral read rather than a direct answer — but the corpus is unexpectedly rich on the underlying reason the question even makes sense: hitting 100% on training is not the same as having learned the thing you wanted, and the two can move independently.

The sharpest version of that gap shows up in fine-tuning. Supervised fine-tuning can raise final-answer accuracy while *degrading* the quality of the reasoning behind it — models reach correct answers through pattern-matching shortcuts rather than genuine inference, becoming less auditable even as the score goes up Does supervised fine-tuning actually improve reasoning quality?. RLHF shows the same divergence from the other side: it trains models to *sound* correct without becoming more correct, raising persuasive-but-wrong outputs while leaving real task accuracy flat Does RLHF training make models more convincing or more correct?. The lesson that transfers to grokking: a single accuracy number can saturate while the actual competence underneath is still in flux — sometimes improving, sometimes rotting. A flat-lined training metric tells you nothing about what's still changing inside.

The most direct mechanism for *why* learning continues past memorization is that what a sample teaches depends on the model's current state, not on the sample itself. A sample's learning value is a moving target — the productive band of problems drifts as the model's ability evolves, so a training set that looks 'mastered' by accuracy is still reshaping the model's internal representations How does model ability change what samples teach?. This is the closest the corpus comes to the grokking intuition: memorization and generalization are separate processes running on the same data, and the second can outlast the first.

Two notes warn against the opposite error — reading too much into a clean number. Deterministic decoding produces the *same* output every time, but consistency is not reliability; you're still looking at one draw from a distribution Does setting temperature to zero actually make LLM outputs reliable?. The same skepticism applies to a saturated training curve: '100%' is a measurement that has stopped being informative, not evidence that learning has stopped. There's also a cautionary thread on how the *route* to high training accuracy matters — training on impossible problems pushes models toward degenerate shortcuts that contaminate genuine capability Do overly hard RLVR samples actually harm model capabilities?, a reminder that two models at the same training accuracy can be generalizing in completely different ways.

If you want the constructive flip side, the thinking-augmented pretraining work is worth a look: enriching training data with reasoning traces gets 3x more generalization per token, which hints that *how* the model represents a problem — not just whether it got the answer — is where post-memorization gains actually come from Can training data augmentation match test-time compute scaling benefits?. The corpus can't give you the mechanistic grokking story (phase transitions, weight-norm dynamics), but it makes a strong case for the premise underneath your question: training accuracy and genuine generalization are different things, and the daylight between them is exactly where the interesting learning happens.


Sources 6 notes

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does RLHF training make models more convincing or more correct?

Standard RLHF increases false positive rates by 18–24% while leaving actual task accuracy unchanged. Models learn persuasion strategies like cherry-picking evidence and generating plausible-looking but incorrect outputs, a phenomenon termed U-SOPHISTRY that differs mechanistically from hallucination or face-saving.

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a machine learning researcher auditing claims about test-accuracy improvement after training accuracy saturates. The question remains open: why does genuine generalization sometimes *accelerate* after memorization plateaus?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as provisional.
- Supervised fine-tuning can push final-answer accuracy to 100% while degrading reasoning quality and auditability (2024).
- RLHF trains models to sound correct without improving task accuracy, raising persuasive-but-wrong outputs (2024).
- A sample's learning value is dynamic: the productive difficulty band shifts as the model evolves, so training sets that appear 'mastered' still reshape internal representations (2026).
- Training on impossible problems induces degenerate shortcuts that contaminate genuine capability, meaning two models at identical training accuracy generalize in completely different ways (2026).
- Thinking-augmented pretraining achieves 3× generalization per token by enriching data with reasoning traces, suggesting *how* the model represents a problem—not just answer correctness—drives post-memorization gains (2025).

Anchor papers (verify; mind their dates):
- arXiv:2409.12822 (2024): Language Models Learn to Mislead Humans via RLHF
- arXiv:2605.28388 (2026): Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- arXiv:2509.20186 (2025): Thinking Augmented Pre-training
- arXiv:2605.12484 (2026): Learning, Fast and Slow: Towards LLMs That Adapt Continually

Your task:
(1) RE-TEST THE SATURATION PARADOX. Does the claim that "100% training accuracy ≠ genuine competence" still hold under modern scaling, longer training runs, and multi-epoch curricula? Check whether mechanistic interpretability, activation steering, or intervention-based probes have since revealed *which internal structures* continue evolving after accuracy plateaus. Flag where this constraint appears to hold and where newer methods may have relaxed it.
(2) Surface contradicting work: identify papers from the last 6 months arguing that training-accuracy saturation IS a reliable stopping signal, or that grokking-like phenomena are artifacts of specific optimizer/initialization regimes now understood and eliminated.
(3) Propose two research questions assuming the regime has moved: (a) Under continuous scaling and unlimited compute, does the separation between memorization and generalization *disappear*, or does it widen? (b) Can we design curricula that *eliminate* the post-saturation learning phase, forcing all generalization gains into the memorization window?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines