SYNTHESIS NOTE

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

Synthesis note · 2026-02-23 · sourced from MechInterp

The FER hypothesis (Fractured Entangled Representation) poses a fundamental challenge to representational optimism — the implicit belief that as models scale and perform better, their internal representations must also be improving.

The experimental setup is elegantly simple: compare a CPPN evolved through open-ended search (Picbreeder) with an SGD-trained CPPN that reproduces the same output pixel-for-pixel. The outputs are identical. The internal representations are radically different. The evolved network explicitly represents the symmetry of a skull — perturbing weights produces coherent variations (winking, warping) that respect the underlying structure. The SGD-trained network shatters symmetry under the slightest perturbation, producing incoherent fragments that reveal no understanding of what it draws.

This is "imposter intelligence": the external appearance implies authentic internal representation, but the reality underneath is fractured across arbitrary subdomains and entangled across unrelated computations.

Three consequences for large models:

Generalization in data-sparse regions. FER means the model cannot apply general principles from well-covered regions to sparse borderlands — precisely where AI could make its most valuable contributions. The principles are fractured, so they only apply to narrow arbitrary subdomains.
Creativity. Creating something new requires understanding the regularities of what exists. If those regularities are represented fracturely — counting bricks uses different circuits than counting apples — the model cannot extend or recombine concepts coherently.
Continual learning. Learning is movement through weight space. If nearby points in weight space break regularities rather than respect them, learning cannot build on deep discoveries. This compounds in continual learning scenarios.

The challenge: standard benchmarks, including comprehensive behavioral evaluations, cannot distinguish FER from genuine representation. The imposter skull produces correct output for every possible input. Only weight perturbation analysis — probing the neighborhood of the solution, not the solution itself — reveals the pathology.

This reframes what it means for a model to "understand" something: Can LLMs understand concepts they cannot apply? describes the behavioral symptom. FER describes the mechanistic cause — the internal representation is fractured in ways that prevent the understanding from transferring to novel contexts.

Inquiring lines that read this note 42

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do we evaluate AI systems when user perception misleads actual performance?

Why do one-shot transparency studies miss the temporal reversal entirely?

Can AI-generated outputs constitute genuine knowledge or valid claims?

How does reasoning graph topology affect breakthrough insights and generalization?

Why do structural signals across edges resist noise better than single-edge counts?

What limits mechanistic interpretability's ability to characterize models?

Do autonomous architecture discoveries follow predictable scaling laws?

What makes weaker teacher models effective for stronger student training?

How does activation consistency training differ from output-level consistency?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Can neural networks implement genuine algorithms or only statistical pattern matching?

What factors beyond surface content determine how readers extract meaning differently?

What distinguishes genuine understanding from correct output without coherent principles?

How does test-time aggregation affect reasoning correctness and reliability?

When do aggregated imperfect demonstrations fail to outperform the best expert?

How can LLM recommenders match or exceed collaborative filtering performance?

What non-linear patterns do autoencoders discover that matrix factorization misses?

Do language model representations contain causally steerable task-specific features?

Can steering vectors prove that representations are genuinely organized?

Why do LLM research ideas score high on novelty yet collapse into low diversity?

What makes a novel research idea practically infeasible for implementation?

Do language models develop causal world models or rely on statistical patterns?

How do internal representations compare to human cognitive structures?

How can identical external performance mask different internal representations?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How do evaluation biases undermine LLM quality assessment systems?

Can structured decomposition fix evaluation gaps in other research tasks?

What determines success in training models on multiple tasks?

Can language model hallucination be prevented or only managed?

Is hallucination mechanistically identical to generalization across datasets?

When does architectural design matter more than raw model capacity?

What makes a small surgical wide component sufficient with a capable deep model?

How do training priors constrain what context information can override?

What is the difference between changing model outputs versus changing internal representations?

How does sequence length affect sparsity tolerance in models?

How does representation sparsity change when inputs fall outside the training distribution?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can similar outputs from different systems prove they work the same way?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Where do neural networks still fail at compositional generalization despite scaling?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Why should decomposition be diagnosed and fixed separately from solving?

When does optimizing for quality undermine the value of diversity?

How does mutual information between inputs and outputs differ from measuring raw diversity?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 173 in 2-hop network ·medium cluster Open in graph ↗

Can identical outputs hide broken internal repre… Can LLMs understand concepts they cannot apply? Do foundation models learn world models or task-sp… Why do neural networks fail at compositional gener… Does supervised fine-tuning improve reasoning or j… Do standard analysis methods hide nonlinear featur… Can auditors discover what hidden objectives a mod…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can LLMs understand concepts they cannot apply? Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
FER provides the mechanistic explanation for why correct output can coexist with failed generalization
Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
task-specific heuristics are what FER predicts: fractured solutions that work locally but lack unified principles
Why do neural networks fail at compositional generalization? Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
FER is what binding failure looks like from the representation side
Does supervised fine-tuning improve reasoning or just answers? Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
another case where performance metrics hide internal degradation
Do standard analysis methods hide nonlinear features in neural networks? Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
AxBench compounds the FER detection problem: standard analysis tools are biased toward simple linear features, so fractured representations may appear normal through PCA/probing while the complex entangled structure remains invisible to our diagnostic methods
Can auditors discover what hidden objectives a model learned? Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.
blind audits demonstrate that models generalize misalignment beyond trained exploits — the same surface-beneath-surface problem FER identifies; both argue performance-level evaluation is insufficient and internal structure analysis is required

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

fractured entangled representations mean identical performance can mask fundamentally broken internal structure

Can identical outputs hide broken internal representations?

Inquiring lines that read this note 42

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4