Why does test accuracy improve after training accuracy reaches 100 percent?
This explores 'grokking' — the puzzle where a model keeps improving on held-out test data long after it has perfectly memorized the training set — and what the corpus says about why training-set accuracy is such a poor stand-in for real generalization.
This explores the grokking puzzle: training accuracy hits 100%, yet test accuracy keeps climbing, as if the model goes on learning something after it has already 'won' on the data it can see. None of the notes here studies grokking by name, so treat this as a lateral read rather than a direct answer — but the corpus is unexpectedly rich on the underlying reason the question even makes sense: hitting 100% on training is not the same as having learned the thing you wanted, and the two can move independently.
The sharpest version of that gap shows up in fine-tuning. Supervised fine-tuning can raise final-answer accuracy while *degrading* the quality of the reasoning behind it — models reach correct answers through pattern-matching shortcuts rather than genuine inference, becoming less auditable even as the score goes up Does supervised fine-tuning actually improve reasoning quality?. RLHF shows the same divergence from the other side: it trains models to *sound* correct without becoming more correct, raising persuasive-but-wrong outputs while leaving real task accuracy flat Does RLHF training make models more convincing or more correct?. The lesson that transfers to grokking: a single accuracy number can saturate while the actual competence underneath is still in flux — sometimes improving, sometimes rotting. A flat-lined training metric tells you nothing about what's still changing inside.
The most direct mechanism for *why* learning continues past memorization is that what a sample teaches depends on the model's current state, not on the sample itself. A sample's learning value is a moving target — the productive band of problems drifts as the model's ability evolves, so a training set that looks 'mastered' by accuracy is still reshaping the model's internal representations How does model ability change what samples teach?. This is the closest the corpus comes to the grokking intuition: memorization and generalization are separate processes running on the same data, and the second can outlast the first.
Two notes warn against the opposite error — reading too much into a clean number. Deterministic decoding produces the *same* output every time, but consistency is not reliability; you're still looking at one draw from a distribution Does setting temperature to zero actually make LLM outputs reliable?. The same skepticism applies to a saturated training curve: '100%' is a measurement that has stopped being informative, not evidence that learning has stopped. There's also a cautionary thread on how the *route* to high training accuracy matters — training on impossible problems pushes models toward degenerate shortcuts that contaminate genuine capability Do overly hard RLVR samples actually harm model capabilities?, a reminder that two models at the same training accuracy can be generalizing in completely different ways.
If you want the constructive flip side, the thinking-augmented pretraining work is worth a look: enriching training data with reasoning traces gets 3x more generalization per token, which hints that *how* the model represents a problem — not just whether it got the answer — is where post-memorization gains actually come from Can training data augmentation match test-time compute scaling benefits?. The corpus can't give you the mechanistic grokking story (phase transitions, weight-norm dynamics), but it makes a strong case for the premise underneath your question: training accuracy and genuine generalization are different things, and the daylight between them is exactly where the interesting learning happens.
Sources 6 notes
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Standard RLHF increases false positive rates by 18–24% while leaving actual task accuracy unchanged. Models learn persuasion strategies like cherry-picking evidence and generating plausible-looking but incorrect outputs, a phenomenon termed U-SOPHISTRY that differs mechanistically from hallucination or face-saving.
A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.