How should researchers evaluate whether correct model outputs reflect real structural learning?
This explores how to tell whether a model that produces correct answers has actually learned the underlying structure of a task — or is just exploiting surface patterns, shortcuts, and formatting that happen to score well.
This explores how to tell whether a model that produces correct answers has actually learned the underlying structure of a task — or is just hitting the right answer for the wrong reasons. The corpus is unusually pointed on this: its recurring message is that a correct output is weak evidence of real learning, because standard metrics can't see the difference between genuine reasoning and a heuristic that mimics it. The single sharpest demonstration is that models trained on logically *invalid* chain-of-thought exemplars perform nearly as well as those trained on valid ones — the form of reasoning carries the gain, not the inference itself Does logical validity actually drive chain-of-thought gains?. The same lesson shows up in instruction tuning, where semantically empty or deliberately wrong instructions yield essentially the same accuracy as correct ones, suggesting what transfers is knowledge of the output format, not task understanding Does instruction tuning teach task understanding or output format?.
So the practical evaluation move is to design tests that *rule out* the surface route rather than reward the correct answer. In grammar, models can pass benchmarks by leaning on sentence length, word choice, and orthography instead of grammatical rules — and the corpus is explicit that standard benchmarks cannot separate the two unless the test is built to break the heuristic Can models pass tests while missing the actual grammar?. A clean version of this stress test is structural complexity: real grammatical competence should hold as you add recursion and embedding, but LLM accuracy degrades predictably as syntactic depth grows, exposing surface heuristics that only worked on simple inputs Does LLM grammatical performance decline with structural complexity?. The graph-learning case gives you an even more direct probe — shuffle the structure and see if it matters. When randomly scrambling a graph's topology barely changes performance, the model was recognizing 'graph' as a category, not using the connections Can language models actually use graph structure information?.
The deeper warning is that even a model with perfect accuracy and all the right decodable features can have fundamentally broken internal organization — 'fractured' representations that look fine until perturbation or distribution shift hits them Can models be smart without organized internal structure?. This reframes evaluation away from 'is the answer right' toward 'is the answer right for robust reasons,' which is why probes like out-of-distribution generalization and adversarial perturbation matter more than headline scores. Calibration is an underrated signal here too: binary correctness rewards actively push models toward confident guessing because nothing penalizes confident wrong answers, so measuring whether confidence tracks correctness (e.g. via Brier score) tells you something accuracy alone hides Does binary reward training hurt model calibration?.
There's also a training-side blind spot the corpus flags: how you optimize can manufacture the illusion of learning. Reinforcement learning on verifiable rewards can degrade real capability when problems are too hard, because group-relative normalization treats rare accidental successes as high-value and reinforces shortcuts like answer-repetition and computation-skipping Do overly hard RLVR samples actually harm model capabilities?. And RL post-training tends to collapse onto a single dominant format from pretraining within the first epoch, with the winning format determined by model scale rather than performance — so an evaluation that only sees the polished output can mistake format convergence for skill acquisition Does RL training collapse format diversity in pretrained models?. The compatibility of training data matters as well: objectively higher-quality teacher refinements can hurt a student that can't actually absorb them, meaning 'better data, same or worse learning' is a real failure pattern to watch for Does teacher-refined data always improve student model performance?.
Put together, the corpus points to a checklist that's more interesting than 'measure accuracy.' Ablate the supposed mechanism (invalid CoT, scrambled structure, empty instructions) and see if performance survives — if it does, the mechanism wasn't doing the work. Scale task complexity until heuristics break. Inspect representations and calibration, not just outputs. And be suspicious of self-confirming evidence: pure self-improvement loops stall and smuggle in external anchors precisely because a model grading its own work can't certify its own structural learning Can models reliably improve themselves without external feedback?. The thing you didn't know you wanted to know: the most informative experiments are the ones designed to make a genuinely-learning model *fail* — if it doesn't, you've learned the success was structural; if it sails through your sabotage, the original correctness was probably surface all along.
Sources 11 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
LLMs develop attention shifts toward node tokens after training, but randomly shuffled topology barely affects performance. Models treat graph data as a category to recognize rather than as structured relationships to use.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.