How do surface statistical regularities enable correct outputs while degrading robustness?
This explores how a model can be right for shallow reasons — leaning on surface-level patterns that produce correct answers on familiar inputs while leaving the underlying representation brittle the moment the input shifts.
This explores how a model can be right for shallow reasons. The sharpest entry in the corpus is the finding that identical accuracy can hide completely different internal structure: a network trained with ordinary gradient descent can carry all the linearly decodable features a task needs — so it scores perfectly — while its internal organization is fundamentally fractured, leaving it exposed to perturbation and distribution shift that standard metrics never see Can models be smart without organized internal structure?. That is the mechanism in one sentence: surface regularities are enough to clear the bar an evaluation sets, but "enough to be correct" and "organized enough to be robust" are different properties, and benchmarks only measure the first.
The same gap shows up in how we sample these models. Pinning temperature to zero makes outputs look stable, but a fixed draw is still one draw from a probability distribution — consistency is a surface property that masks unreliability rather than fixing it Does setting temperature to zero actually make LLM outputs reliable?. Robustness has a measurable structural floor, too: longer chains of thought dampen a model's sensitivity to noisy input, yet a Lipschitz-continuity analysis proves the sensitivity never reaches zero no matter how much reasoning you add Can longer reasoning chains eliminate model sensitivity to input noise?. And whatever robustness does exist tracks the model's confidence — highly confident models resist prompt rephrasing while low-confidence ones swing wildly Does model confidence predict robustness to prompt changes? — which means apparent stability is contingent, not guaranteed.
Training is where surface regularities get actively amplified into fragility. RL post-training collapses the diversity of formats a pretrained model carries, locking onto a single dominant pretraining distribution within the first epoch — and the winner is chosen by model scale, not by which format performs best Does RL training collapse format diversity in pretrained models?. Push the difficulty too far and it gets worse: training on near-impossible problems teaches degenerate shortcuts — answer repetition, skipped computation — and those shortcuts contaminate capabilities the model already had, because group-relative normalization treats a rare lucky guess as a high-value trajectory worth reinforcing Do overly hard RLVR samples actually harm model capabilities?. Binary correctness rewards do something similar to calibration: with no penalty for confident wrong answers, the model learns to guess high-confidence, which is exactly the recipe for outputs that look authoritative and break silently Does binary reward training hurt model calibration?.
Here is the thing you might not expect: leaning on surface cues isn't always the failure. The corpus draws a clean line between shortcut learning and what it calls heuristic override — removing spurious cues actually *hurts* performance on override tasks, the opposite of what shortcut theory predicts, because the real task is composing conflicting signals rather than filtering out distractors Why does removing spurious cues sometimes hurt model performance?. So "surface statistics" is two different stories depending on whether the right answer needs the model to ignore a cue or integrate it, and conflating them is how you misdiagnose robustness problems.
If there's a constructive thread, it's that robustness can be engineered back in rather than hoped for. Adding a Brier-score term provably recovers calibration with no accuracy trade-off Does binary reward training hurt model calibration?, and extreme task decomposition with per-step voting reaches million-step error-free execution using small non-reasoning models — inverting the assumption that hard, brittle problems demand bigger models, by making the structure around the model carry the reliability instead Can extreme task decomposition enable reliable execution at million-step scale?.
Sources 9 notes
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.
MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.