INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›What limits mechanistic interpreta…›this inquiring line

Models can nail every test while secretly broken inside — that hidden damage may explain why they fail at genuinely new combinations.

Can fractured representations explain why models fail at systematic generalization?

This explores whether 'fractured representations' — networks that produce correct outputs while their internal structure is broken or entangled — are the real reason models stumble when asked to recombine what they know into novel situations.

This explores whether 'fractured representations' — the idea that a model can get answers right while its internal wiring is tangled and brittle — explain systematic generalization failures. The corpus suggests it's a strong candidate, but it's one of several overlapping explanations, and reading them together is more illuminating than any one alone.

The core claim comes from work showing that two networks with *identical* outputs can have radically different internals: networks trained with ordinary gradient descent develop fractured, entangled representations that look fine on the test set but shatter under small weight perturbations and refuse to transfer to new contexts or recombine creatively Can identical outputs hide broken internal representations?. That's a direct mechanism for generalization failure — if the parts aren't cleanly separable, you can't reassemble them into something new. The deeper theoretical version is the *binding problem*: networks struggle to segregate distinct entities from input, keep their representations separate, and reuse learned structure in novel combinations — which is offered explicitly as *the* explanation for why neural nets fail at compositional generalization Why do neural networks fail at compositional generalization?.

But here's where the corpus complicates the story. A competing line of evidence says transformers don't fail because their representations are fractured — they fail because they were never doing systematic reasoning in the first place. They succeed in-distribution by memorizing and matching computation subgraphs from training, then collapse on novel compositions Do transformers actually learn systematic compositional reasoning?. A related finding reframes the failure boundary itself: models don't break at some complexity threshold, they break at *instance novelty* — any reasoning chain works if the model saw similar instances, because it's fitting instance-level patterns rather than learning a generalizable algorithm Do language models fail at reasoning due to complexity or novelty?. So 'fractured representation' and 'pattern-matching instead of reasoning' may be two descriptions of the same underlying gap, seen from the inside (broken structure) versus the outside (novelty-bounded behavior).

The most interesting twist is that fracturing isn't inevitable — and might even be partly fixable. Pruning experiments show networks *do* sometimes decompose compositional tasks into clean, isolated subnetworks, and that pretraining substantially increases how reliable and modular that structure is Do neural networks naturally learn modular compositional structure?. Scaling can partly overcome the binding problem by letting compositional representations emerge Why do neural networks fail at compositional generalization?. And the way a model *encodes* unfamiliarity matters too: under out-of-distribution shift, models sparsify their activations in a localized way that acts as a stabilizing filter rather than a breakdown Do language models sparsify their activations under difficult tasks?, and this density-vs-sparsity pattern is itself *learned* through how familiar the training data was Is representational sparsity learned or intrinsic to neural networks?. That suggests fracturing is a property of the training regime, not a fixed law of architecture.

The thing you might not have known you wanted to know: whether representations end up fractured or modular seems to be decided largely by *exposure and pretraining* rather than by the architecture alone — which means systematic generalization failure may be less an unfixable flaw of transformers and more a symptom of how, and on what, we train them.

Sources 7 notes

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Show all 7 sources

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Break It Down: Evidence for Structural Compositionality in Neural Networks3.48 match · arxiv ↗
Scaling can lead to compositional generalization3.45 match · arxiv ↗
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis2.53 match · arxiv ↗
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks2.51 match · arxiv ↗
How do Transformers Learn Implicit Reasoning?2.46 match · arxiv ↗
Open Problems in Mechanistic Interpretability2.43 match · arxiv ↗
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs1.79 match · arxiv ↗
Faith and Fate: Limits of Transformers on Compositionality1.75 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-testing whether 'fractured representations' remain a viable explanation for systematic generalization failure in LLMs, given what we know now (late 2024 onwards).

The question: Do entangled, non-modular internal representations *cause* or merely *correlate with* compositional generalization failure—and can newer training regimes, scaling, or architectural choices dissolve this constraint?

What a curated library found — and when (findings span 2020–2026; treat as dated claims, not current ground truth):
• Two networks with identical outputs can have radically different internal structure; ordinary gradient descent produces fractured, brittle representations that fail under perturbation and transfer (2025).
• The *binding problem*—failure to segregate entities and reuse structure in novel combinations—is offered as *the* explanation for compositional failure (2020, reinforced 2025).
• Competing evidence: transformers may not fail due to fractured representations, but because they memorize and match computation subgraphs from training rather than learn generalizable algorithms (2023–2025).
• Models break at *instance-level unfamiliarity*, not task-level complexity—suggesting they fit instance patterns, not algorithms (2026).
• Fracturing is not inevitable: pretraining and pruning can induce clean modular decomposition; scaling partly overcomes binding problems (2023–2024).
• Under OOD shift, models sparsify activations locally in a learned, stabilizing way, suggesting fracturing is a property of training regime, not architecture (2026).

Anchor papers (verify; mind their dates):
• arXiv:2012.05208 (2020): On the Binding Problem in Artificial Neural Networks
• arXiv:2505.11581 (2025): Questioning Representational Optimism in Deep Learning—Fractured Entangled Representations
• arXiv:2305.18654 (2023): Faith and Fate—Limits of Transformers on Compositionality
• arXiv:2603.03415 (2026): Farther the Shift, Sparser the Representation—Analyzing OOD Mechanisms

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above—fractured structure, binding problem, subgraph-matching, instance-level novelty, and learnability of sparsity—determine whether post-2024 models (GPT-4o, Claude 3.5, Llama 3.x, or newer), scaling laws, constitutional AI / preference tuning, chain-of-thought scaffolding, or mechanistic intervention (e.g., sparse autoencoders, steering) have *relaxed* or *overturned* the constraint. Separate the durable question ("Do representations encode compositionality?") from perishable claims ("Gradient descent always produces fractured internals"). State plainly where each constraint still appears to hold.
(2) Surface the strongest *contradicting* or *superseding* work from the last ∼6 months. Does recent SAE research, scaling laws for compositionality, or post-training methods (RL, DPO) change the binding picture? Cite arXiv IDs.
(3) Propose 2 research questions that *assume* the regime may have moved—e.g., "If pretraining density is learnable, can we design corpora that induce compositional structure at scale?" or "Do emergent reasoning capabilities (like recursive search) restructure internal bindings to overcome entanglement?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Models can nail every test while secretly broken inside — that hidden damage may explain why they fail at genuinely new combinations.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8