SYNTHESIS NOTE

Do transformers actually learn systematic compositional reasoning?

Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.

Synthesis note · 2026-03-28 · sourced from Evaluations

"Faith and Fate: Limits of Transformers on Compositionality" (Dziri et al., 2023) provides the clearest empirical decomposition of how transformers actually handle compositional tasks — and why they fail.

The test bed is three representative tasks: multi-digit multiplication, logic grid puzzles (Einstein's puzzle), and a classic dynamic programming problem. Each is formulated as a computation graph with measurable complexity. The results are devastating for systematic reasoning claims: training on task-specific data leads to near-perfect performance on in-distribution instances at low compositional complexity, but "fails drastically on instances outside of this region."

The mechanism: transformers solve compositional tasks by reducing multi-step reasoning into linearized path matching. When a test problem's computation subgraph was seen during training (or closely resembles one), the model succeeds. When the composition is novel — requiring the model to apply computational rules to unseen combinations — it fails. This is shortcut learning: "may yield fast correct answers when similar compositional patterns are available during training but does not allow for robust generalization to uncommon or complex examples."

The error analysis is particularly revealing. While models can memorize single-step operations, they fail to compose them into correct reasoning paths. The failure is not random — it is systematic, suggesting "predictions based on shallow, rote learning rather than a deep, holistic task understanding." Error propagation makes this worse: errors in early stages compound in subsequent steps, creating an inherent ceiling on complex compositional tasks.

This provides the task-specific mechanism for what Do foundation models learn world models or task-specific shortcuts? describes at a higher level. The heuristic IS linearized subgraph matching — and it works well enough within the training distribution to create the illusion of systematic reasoning. Since Can neural networks learn compositional skills without symbolic mechanisms?, the Faith and Fate finding adds the critical qualifier: scaling helps only insofar as it increases training coverage of computation subgraphs. Novel compositions remain unsolved.

The implication for chain-of-thought: since Does logical validity actually drive chain-of-thought gains?, CoT may work not because it enables systematic reasoning but because it decomposes problems into subgraphs the model has already seen. CoT as subgraph decomposition rather than logical inference.

Inquiring lines that read this note 81

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What structural biases does transformer attention create in language model outputs?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

What limits mechanistic interpretability's ability to characterize models?

Do autonomous architecture discoveries follow predictable scaling laws?

Does architectural discovery follow an empirical scaling law like neural networks?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

How does memorization interact with learning and generalization?

Why do energy-based models generalize better on out-of-distribution data than standard transformers?

How does example difficulty affect learning efficiency in language models?

How should retrieval systems optimize for multi-step reasoning during inference?

Can long-context readers handle compositional tasks or just semantic search?

How do training priors constrain what context information can override?

Why do reasoning models fail at systematic problem-solving and search?

How can identical external performance mask different internal representations?

Why does input embedding magnitude affect perturbation sensitivity in transformers?

How do training data properties shape reasoning capability development?

How much does training composition affect syntactic versus reasoning performance?

How does reasoning graph topology affect breakthrough insights and generalization?

Do substitute networks converge differently than complement networks?

What determines success in training models on multiple tasks?

Why do semantic similarity and task relevance diverge in vector embeddings?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Can language model RL training avoid reward hacking and misalignment?

Can categorical correctness signals stop dense optimizers from finding loopholes?

Does decoupling planning from execution improve multi-step reasoning accuracy?

When does backward decomposition fail on open-ended or unstructured tasks?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 113 in 2-hop network ·medium cluster Open in graph ↗

Do transformers actually learn systematic compos… Do foundation models learn world models or task-sp… Can neural networks learn compositional skills wit… Why do neural networks fail at compositional gener… Does logical validity actually drive chain-of-thou…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
subgraph matching is the specific heuristic for compositional tasks
Can neural networks learn compositional skills without symbolic mechanisms? Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
scaling helps by covering more subgraphs, not by creating systematic reasoning
Why do neural networks fail at compositional generalization? Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
theoretical explanation for why linearized matching fails on novel compositions
Does logical validity actually drive chain-of-thought gains? What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
CoT may succeed via subgraph decomposition, not logical validity

Do transformers actually learn systematic compositional reasoning?

Inquiring lines that read this note 81

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4