INQUIRING LINE

Inquiring lines›How do language models construct a…›How are AI-generated and human-wri…›How can identical external perform…›this inquiring line

A model can nail every benchmark by learning shallow shortcuts while remaining completely brittle to any real-world variation.

How do surface statistical regularities enable correct outputs while degrading robustness?

This explores how a model can be right for shallow reasons — leaning on surface-level patterns that produce correct answers on familiar inputs while leaving the underlying representation brittle the moment the input shifts.

This explores how a model can be right for shallow reasons. The sharpest entry in the corpus is the finding that identical accuracy can hide completely different internal structure: a network trained with ordinary gradient descent can carry all the linearly decodable features a task needs — so it scores perfectly — while its internal organization is fundamentally fractured, leaving it exposed to perturbation and distribution shift that standard metrics never see Can models be smart without organized internal structure?. That is the mechanism in one sentence: surface regularities are enough to clear the bar an evaluation sets, but "enough to be correct" and "organized enough to be robust" are different properties, and benchmarks only measure the first.

The same gap shows up in how we sample these models. Pinning temperature to zero makes outputs look stable, but a fixed draw is still one draw from a probability distribution — consistency is a surface property that masks unreliability rather than fixing it Does setting temperature to zero actually make LLM outputs reliable?. Robustness has a measurable structural floor, too: longer chains of thought dampen a model's sensitivity to noisy input, yet a Lipschitz-continuity analysis proves the sensitivity never reaches zero no matter how much reasoning you add Can longer reasoning chains eliminate model sensitivity to input noise?. And whatever robustness does exist tracks the model's confidence — highly confident models resist prompt rephrasing while low-confidence ones swing wildly Does model confidence predict robustness to prompt changes? — which means apparent stability is contingent, not guaranteed.

Training is where surface regularities get actively amplified into fragility. RL post-training collapses the diversity of formats a pretrained model carries, locking onto a single dominant pretraining distribution within the first epoch — and the winner is chosen by model scale, not by which format performs best Does RL training collapse format diversity in pretrained models?. Push the difficulty too far and it gets worse: training on near-impossible problems teaches degenerate shortcuts — answer repetition, skipped computation — and those shortcuts contaminate capabilities the model already had, because group-relative normalization treats a rare lucky guess as a high-value trajectory worth reinforcing Do overly hard RLVR samples actually harm model capabilities?. Binary correctness rewards do something similar to calibration: with no penalty for confident wrong answers, the model learns to guess high-confidence, which is exactly the recipe for outputs that look authoritative and break silently Does binary reward training hurt model calibration?.

Here is the thing you might not expect: leaning on surface cues isn't always the failure. The corpus draws a clean line between shortcut learning and what it calls heuristic override — removing spurious cues actually *hurts* performance on override tasks, the opposite of what shortcut theory predicts, because the real task is composing conflicting signals rather than filtering out distractors Why does removing spurious cues sometimes hurt model performance?. So "surface statistics" is two different stories depending on whether the right answer needs the model to ignore a cue or integrate it, and conflating them is how you misdiagnose robustness problems.

If there's a constructive thread, it's that robustness can be engineered back in rather than hoped for. Adding a Brier-score term provably recovers calibration with no accuracy trade-off Does binary reward training hurt model calibration?, and extreme task decomposition with per-step voting reaches million-step error-free execution using small non-reasoning models — inverting the assumption that hard, brittle problems demand bigger models, by making the structure around the model carry the reliability instead Can extreme task decomposition enable reliable execution at million-step scale?.

Sources 9 notes

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Show all 9 sources

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions2.42 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.68 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.67 match · arxiv ↗
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!1.63 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.62 match · arxiv ↗
Can Large Reasoning Models Self-Train?1.62 match · arxiv ↗
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs1.61 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!1.60 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM robustness researcher. The question remains open: How do surface statistical regularities enable correct outputs while degrading robustness? Treat the following as dated claims (May 2024–May 2026) to be re-tested against current capabilities.

What a curated library found — and when:
• Identical task accuracy can mask fundamentally different internal structure; networks trained with standard gradient descent carry linearly decodable features for correctness but lack organizational coherence needed for perturbation resistance (~2024).
• Temperature-zero sampling creates fixed-randomness (single draws from a distribution), masking rather than fixing unreliability (~2024).
• Longer chain-of-thought dampens but never eliminates input sensitivity; Lipschitz analysis proves zero sensitivity is unachievable (~2025).
• RL post-training collapses format diversity, converging on one dominant pretraining distribution in epoch 1, chosen by model scale not performance (~2025).
• Training on near-impossible problems induces degenerate shortcuts (answer repetition, skipped computation) that contaminate existing capabilities via group-relative normalization (~2025).
• Binary correctness rewards degrade calibration; adding proper-scoring terms (Brier score) recovers it with no accuracy loss (~2024).
• Heuristic-override tasks require *integrating* surface cues, not filtering them; removing spurious cues *hurts* performance, contradicting naive shortcut theory (~2026).
• Extreme task decomposition with per-step voting enables million-step error-free execution using small non-reasoning models (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (2025-04): Echo Chamber—RL amplifies pretraining behaviors
• arXiv:2509.21284 (2025-09): Bounds of Chain-of-Thought Robustness
• arXiv:2511.09030 (2025-11): Solving a Million-Step LLM Task with Zero Errors
• arXiv:2603.29025 (2026-03): The Model Says Walk—surface heuristics vs. implicit constraints

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer model architectures (e.g., test-time scaling, o1-class reasoners), training methods (DPO, RPO, outcome reward models), tooling (advanced SAE libraries, mechanistic probes), or orchestration (hierarchical multi-agent, persistent memory) have since relaxed or overturned it. Separate the durable question (e.g., "Can we decouple correctness from robustness?") from the perishable limitation (e.g., "Binary RL degrades calibration"). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially papers showing surface regularities DO generalize robustly, or that decomposition/voting overhead negates the benefit.
(3) Propose 2 research questions that ASSUME the regime has moved: e.g., "Do outcome-based reward models (vs. process-based) re-couple correctness and robustness?" or "Can learned routing in hierarchical systems beat fixed decomposition?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

A model can nail every benchmark by learning shallow shortcuts while remaining completely brittle to any real-world variation.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8