INQUIRING LINE

Why do internal representations differ when external performance matches?

This explores how two models can score the same on benchmarks while organizing knowledge completely differently inside — and why that internal divergence matters even when the scoreboard doesn't show it.


This explores how two models can score the same on benchmarks while organizing knowledge completely differently inside — and why that internal divergence matters even when the scoreboard doesn't show it. The sharpest version of the puzzle comes from the "fractured entangled representation" work: networks trained with standard gradient descent can reproduce outputs perfectly, yet their internal weights are tangled and brittle compared to networks evolved toward the same behavior. Perturb a weight and the fracture shows — the model can't transfer to new contexts or recombine ideas creatively, even though every test it was given came back clean Can identical outputs hide broken internal representations?. The unsettling implication is that a system can pass every exam and still 'understand' nothing structurally coherent Can AI pass every test while understanding nothing?.

Why does this happen? Partly because benchmarks measure the output, not the path to it, and many internal arrangements can land on the same answer. There's a real trade-off hidden underneath: improving one capability tends to silently degrade others, so identical headline scores can sit on top of very different balances of faithfulness, calibration, and diversity What really happens inside a language model?. Training dynamics make this worse — reinforcement learning post-training tends to collapse onto a single dominant format from pretraining within the first epoch, suppressing alternatives. Which format 'wins' depends on model scale rather than performance, so two equally-scoring models may have quietly thrown away different parts of their range Does RL training collapse format diversity in pretrained models?.

The representations also differ because they're shaped by what the model has seen, not by the task you test it on. Networks build dense activations for familiar data and fall back to sparse ones for unfamiliar inputs, a structure learned through exposure during pretraining Is representational sparsity learned or intrinsic to neural networks?. That's why apparent zero-shot 'generalization' often turns out to be interpolation: multimodal performance tracks how frequently a concept appeared in pretraining, not genuine new ability — two models with matching scores may simply have memorized different frequency landscapes Does multimodal zero-shot performance actually generalize or interpolate?. Internals carry the fingerprint of the data; the score doesn't.

Not every internal difference is a defect, though — and this is the part worth lingering on. Models develop genuine machinery you'd never see from outputs alone: sparse autoencoders reveal a self-knowledge mechanism that tracks whether the model actually knows a fact and causally steers it toward refusing or hallucinating Do models know what they don't know?. Hidden states deliberately sparsify under hard, out-of-distribution tasks as an adaptive filter that stabilizes performance rather than breaking it Do language models sparsify their activations under difficult tasks?. So 'same output, different internals' cuts both ways: sometimes the divergence is fracture, sometimes it's a smarter internal strategy doing invisible work to hold performance steady.

The practical upshot is that the internal layer is the real lever, even when the external layer looks settled. You can intervene directly on frozen representations and beat weight-based finetuning by 10–50x on efficiency Can editing hidden representations beat weight updates for finetuning?, and models can be trained to internalize their own evaluation in unused sequence space at zero inference cost Can models learn to evaluate their own work during training?. If you only watch the scoreboard, all of this is invisible — which is exactly why two systems that look identical from the outside can behave so differently the moment you push them somewhere new.


Sources 10 notes

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

What really happens inside a language model?

Research into mechanistic interpretability, cognitive models, and training dynamics shows that identical benchmark performance conceals radically different internal structures. Improving one capability (helpfulness, accuracy) reliably degrades others (faithfulness, calibration, diversity).

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Does multimodal zero-shot performance actually generalize or interpolate?

Across 34 models and 5 datasets, multimodal models require exponentially more pretraining data for linear performance gains on downstream tasks. Performance correlates with how often test concepts appeared during pretraining, not genuine generalization ability.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Next inquiring lines