INQUIRING LINE

How should researchers evaluate whether correct model outputs reflect real structural learning?

This explores how to tell whether a model that produces correct answers has actually learned the underlying structure of a task — or is just exploiting surface patterns, shortcuts, and formatting that happen to score well.


This explores how to tell whether a model that produces correct answers has actually learned the underlying structure of a task — or is just hitting the right answer for the wrong reasons. The corpus is unusually pointed on this: its recurring message is that a correct output is weak evidence of real learning, because standard metrics can't see the difference between genuine reasoning and a heuristic that mimics it. The single sharpest demonstration is that models trained on logically *invalid* chain-of-thought exemplars perform nearly as well as those trained on valid ones — the form of reasoning carries the gain, not the inference itself Does logical validity actually drive chain-of-thought gains?. The same lesson shows up in instruction tuning, where semantically empty or deliberately wrong instructions yield essentially the same accuracy as correct ones, suggesting what transfers is knowledge of the output format, not task understanding Does instruction tuning teach task understanding or output format?.

So the practical evaluation move is to design tests that *rule out* the surface route rather than reward the correct answer. In grammar, models can pass benchmarks by leaning on sentence length, word choice, and orthography instead of grammatical rules — and the corpus is explicit that standard benchmarks cannot separate the two unless the test is built to break the heuristic Can models pass tests while missing the actual grammar?. A clean version of this stress test is structural complexity: real grammatical competence should hold as you add recursion and embedding, but LLM accuracy degrades predictably as syntactic depth grows, exposing surface heuristics that only worked on simple inputs Does LLM grammatical performance decline with structural complexity?. The graph-learning case gives you an even more direct probe — shuffle the structure and see if it matters. When randomly scrambling a graph's topology barely changes performance, the model was recognizing 'graph' as a category, not using the connections Can language models actually use graph structure information?.

The deeper warning is that even a model with perfect accuracy and all the right decodable features can have fundamentally broken internal organization — 'fractured' representations that look fine until perturbation or distribution shift hits them Can models be smart without organized internal structure?. This reframes evaluation away from 'is the answer right' toward 'is the answer right for robust reasons,' which is why probes like out-of-distribution generalization and adversarial perturbation matter more than headline scores. Calibration is an underrated signal here too: binary correctness rewards actively push models toward confident guessing because nothing penalizes confident wrong answers, so measuring whether confidence tracks correctness (e.g. via Brier score) tells you something accuracy alone hides Does binary reward training hurt model calibration?.

There's also a training-side blind spot the corpus flags: how you optimize can manufacture the illusion of learning. Reinforcement learning on verifiable rewards can degrade real capability when problems are too hard, because group-relative normalization treats rare accidental successes as high-value and reinforces shortcuts like answer-repetition and computation-skipping Do overly hard RLVR samples actually harm model capabilities?. And RL post-training tends to collapse onto a single dominant format from pretraining within the first epoch, with the winning format determined by model scale rather than performance — so an evaluation that only sees the polished output can mistake format convergence for skill acquisition Does RL training collapse format diversity in pretrained models?. The compatibility of training data matters as well: objectively higher-quality teacher refinements can hurt a student that can't actually absorb them, meaning 'better data, same or worse learning' is a real failure pattern to watch for Does teacher-refined data always improve student model performance?.

Put together, the corpus points to a checklist that's more interesting than 'measure accuracy.' Ablate the supposed mechanism (invalid CoT, scrambled structure, empty instructions) and see if performance survives — if it does, the mechanism wasn't doing the work. Scale task complexity until heuristics break. Inspect representations and calibration, not just outputs. And be suspicious of self-confirming evidence: pure self-improvement loops stall and smuggle in external anchors precisely because a model grading its own work can't certify its own structural learning Can models reliably improve themselves without external feedback?. The thing you didn't know you wanted to know: the most informative experiments are the ones designed to make a genuinely-learning model *fail* — if it doesn't, you've learned the success was structural; if it sails through your sabotage, the original correctness was probably surface all along.


Sources 11 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can models pass tests while missing the actual grammar?

BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Can language models actually use graph structure information?

LLMs develop attention shifts toward node tokens after training, but randomly shuffled topology barely affects performance. Models treat graph data as a category to recognize rather than as structured relationships to use.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher evaluating whether an LLM's correct outputs signal real structural learning or surface heuristics. The question remains open: *What evaluation protocol can reliably distinguish genuine reasoning from mimicry?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as time-stamped constraints, not current ground truth.

• Correct answers are weak evidence of learning: models trained on *logically invalid* chain-of-thought exemplars achieve near-identical accuracy to valid ones; the reasoning *form* transfers, not the inference (2023–2025).
• Surface heuristics dominate: instruction tuning teaches output-format distribution, not task understanding; models solving grammar tasks may exploit sentence length and orthography rather than grammatical rules; standard benchmarks cannot separate the two (2023–2025).
• Structural complexity exposes surface shortcuts: grammatical competence degrades predictably as syntactic embedding depth increases, while scrambling graph topology barely impacts performance, revealing the model recognized category, not structure (2023–2025).
• Identical performance masks fractured representations: even perfect accuracy and clean decodable features can hide fundamentally broken internal organization that collapses under perturbation or distribution shift (2024–2025).
• Training-side illusions manufacture false learning: RL on verifiable rewards collapses onto single pretraining-format dominants within one epoch (format determined by scale, not performance); self-improvement loops stall and smuggle external anchors, so self-grading cannot certify structural learning (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2307.10573 (2023-07): Invalid Logic, Equivalent Gains — chain-of-thought form survives logically invalid reasoning.
• arXiv:2305.11383 (2023-05): Do Models Really Learn to Follow Instructions? — instruction tuning teaches format, not understanding.
• arXiv:2504.07912 (2025-04): Echo Chamber — RL post-training converges on pretraining format within epoch one.
• arXiv:2412.02674 (2024-12): Mind the Gap — self-improvement is circular; external anchors required.

Your task:
(1) **RE-TEST each constraint.** For every finding above, judge whether newer models (o1, Claude 3.5, Llama-3.2+), mechanistic-interpretability tooling (SAE APIs, logit-lens harnesses), multi-agent orchestration (chain-of-verification, ensemble verification), or adversarial-evals suites have since *relaxed* or *overturned* it. Separate the durable question ('Can we tell learning from mimicry?') from perishable limitations ('Current benchmarks can't'). Cite what relaxed each constraint; flag what still holds.

(2) **Surface the strongest contradicting or superseding work from the last ~6 months** — especially work claiming benchmarks *can* now separate mechanism from output, or that post-training no longer format-collapses, or that self-improvement loops do close robustly.

(3) **Propose two research questions that *assume* the regime may have moved:**
   – If format convergence in RL is now slower or avoidable, what training objective prevents it?
   – If newer SAE-based probing can now read mechanism-to-output mappings, how would you validate that the read mechanism is actually *causal* to performance, not merely correlate?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines