Do transformers actually learn systematic compositional reasoning?
Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.
"Faith and Fate: Limits of Transformers on Compositionality" (Dziri et al., 2023) provides the clearest empirical decomposition of how transformers actually handle compositional tasks — and why they fail.
The test bed is three representative tasks: multi-digit multiplication, logic grid puzzles (Einstein's puzzle), and a classic dynamic programming problem. Each is formulated as a computation graph with measurable complexity. The results are devastating for systematic reasoning claims: training on task-specific data leads to near-perfect performance on in-distribution instances at low compositional complexity, but "fails drastically on instances outside of this region."
The mechanism: transformers solve compositional tasks by reducing multi-step reasoning into linearized path matching. When a test problem's computation subgraph was seen during training (or closely resembles one), the model succeeds. When the composition is novel — requiring the model to apply computational rules to unseen combinations — it fails. This is shortcut learning: "may yield fast correct answers when similar compositional patterns are available during training but does not allow for robust generalization to uncommon or complex examples."
The error analysis is particularly revealing. While models can memorize single-step operations, they fail to compose them into correct reasoning paths. The failure is not random — it is systematic, suggesting "predictions based on shallow, rote learning rather than a deep, holistic task understanding." Error propagation makes this worse: errors in early stages compound in subsequent steps, creating an inherent ceiling on complex compositional tasks.
This provides the task-specific mechanism for what Do foundation models learn world models or task-specific shortcuts? describes at a higher level. The heuristic IS linearized subgraph matching — and it works well enough within the training distribution to create the illusion of systematic reasoning. Since Can neural networks learn compositional skills without symbolic mechanisms?, the Faith and Fate finding adds the critical qualifier: scaling helps only insofar as it increases training coverage of computation subgraphs. Novel compositions remain unsolved.
The implication for chain-of-thought: since Does logical validity actually drive chain-of-thought gains?, CoT may work not because it enables systematic reasoning but because it decomposes problems into subgraphs the model has already seen. CoT as subgraph decomposition rather than logical inference.
Inquiring lines that use this note as a source 69
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What is selective resonance and why do transformers not perform it?
- How do transformers perform multi-hop reasoning across distant training documents?
- Do transformers learn generalizable algorithms or instance-based patterns?
- Can neural networks represent symbolic structures without explicit mechanisms?
- Does architectural discovery follow an empirical scaling law like neural networks?
- Why do text-to-image models fail at composing multiple concepts together?
- What makes linear decodability a reliable signal of compositionality?
- Does scaling model size solve compositional generalization problems?
- How does error propagation limit transformer performance on complex tasks?
- Can symbolic mechanisms improve transformer compositional abilities?
- Can explicit stack mechanisms extend what formal languages transformers can learn?
- How does circuit complexity limit which grammatical structures transformers can acquire?
- Does compositional generalization emerge suddenly or improve smoothly with scale?
- Why do energy-based models generalize better on out-of-distribution data than standard transformers?
- Can neural networks implement genuine algorithms or only statistical pattern matching?
- Why do task-specific heuristics fail at generalizing to sparse data regions?
- Can fractured representations explain why models fail at systematic generalization?
- Why does comparison reasoning generalize better than composition reasoning?
- Can long-context readers handle compositional tasks or just semantic search?
- Why do standard transformers fail on problems requiring serial algorithmic reasoning?
- Does scaling data automatically produce compositional reasoning or just better feature encoding?
- What test distinguishes genuine compositionality from fractured feature presence?
- Why do transformer attention patterns show positional and sequential bias across tasks?
- Could graph neural networks fundamentally outperform transformers on structured reasoning?
- What hidden computations happen inside transformer layers during reasoning?
- Can neural networks learn that A implies B in reverse?
- Can long-context models handle compositional reasoning requiring structured logic?
- Why does input embedding magnitude affect perturbation sensitivity in transformers?
- How much does training composition affect syntactic versus reasoning performance?
- What formal language complexity level matches transformer computational limits best?
- How does explicit stack tracking solve the composition sub-problem in binding?
- Why do standard transformers fail to encode recursive structure in their hidden states?
- What makes recursive structure different from other forms of compositional generalization?
- Do substitute networks converge differently than complement networks?
- Can transformers reason beyond fixed architectural depth limits?
- How do transformers generate harder solutions when mostly trained on easier problems?
- Can bounded-depth transformers solve inherently sequential problems?
- Can scaling alone create compositional generalization without explicit binding mechanisms?
- How do neural networks decompose complex tasks into modular subnetworks?
- Can granular function calling tasks learn composition from graph-sampled data?
- Does training on granular tasks beat training on the full function calling problem?
- What explains the contextual variability of knowledge in transformers?
- How do gradients flowing through both branches simultaneously reshape each component's role?
- What limits the effectiveness of formal language pretraining on transformer architectures?
- Why is a combinatorial framework better than family resemblance classification?
- What makes data augmentation an implicit form of contraction learning?
- What makes modernized N-gram embeddings composable with transformer architectures?
- Can learned verifiers over token similarity replace dense compositional training?
- Does grokking in modular arithmetic follow the same three-phase learning trajectory?
- What role does query-level exposure play in enabling compositional generalization?
- What data properties enable transformers to learn sequential decision-making in context?
- How do transformers stitch together learned behaviors when adapting to new tasks?
- Why does scaling data and model size improve compositional generalization?
- How do neural networks decompose tasks into modular subnetworks that transfer?
- Does sparsity enforce compositional structure or merely amplify existing modularity?
- Why does gradient descent discover compositional structure without explicit pressure?
- Why do long-context language models struggle with compositional reasoning tasks?
- Can energy-based transformers achieve deep reasoning without supervision?
- Can categorical correctness signals stop dense optimizers from finding loopholes?
- Do transformer architectures structurally bias models toward short-term optimization?
- What architectural alternatives can capture compositional structure beyond pooled cosine?
- Can representation analysis methods detect complex features models compute with?
- What makes recurrent depth enable compositional generalization across tasks?
- Why does looping computation outperform adding more transformer layers?
- Can recurrent transformers learn genuinely new computations beyond inference stages?
- How does scaling and training data enable compositional behavior without symbolic mechanisms?
- Where do neural networks still fail at compositional generalization despite scaling?
- Why does reapplying the same transformer block work better than computing new layers?
- Can looping enable reasoning capabilities that fixed-depth transformers fundamentally cannot achieve?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
subgraph matching is the specific heuristic for compositional tasks
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
scaling helps by covering more subgraphs, not by creating systematic reasoning
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
theoretical explanation for why linearized matching fails on novel compositions
-
Does logical validity actually drive chain-of-thought gains?
What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
CoT may succeed via subgraph decomposition, not logical validity
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Faith and Fate: Limits of Transformers on Compositionality
- Scaling can lead to compositional generalization
- Compositional Reasoning with Transformers, RNNs, and Chain of Thought
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
- How do Transformers Learn Implicit Reasoning?
Original note title
compositional reasoning in transformers reduces to linearized subgraph matching — success depends on training exposure to similar computation subgraphs not systematic problem-solving