SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Do transformers actually learn systematic compositional reasoning?

Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.

Synthesis note · 2026-03-28 · sourced from Evaluations
What kind of thing is an LLM really? Do reasoning traces show how models actually think?

"Faith and Fate: Limits of Transformers on Compositionality" (Dziri et al., 2023) provides the clearest empirical decomposition of how transformers actually handle compositional tasks — and why they fail.

The test bed is three representative tasks: multi-digit multiplication, logic grid puzzles (Einstein's puzzle), and a classic dynamic programming problem. Each is formulated as a computation graph with measurable complexity. The results are devastating for systematic reasoning claims: training on task-specific data leads to near-perfect performance on in-distribution instances at low compositional complexity, but "fails drastically on instances outside of this region."

The mechanism: transformers solve compositional tasks by reducing multi-step reasoning into linearized path matching. When a test problem's computation subgraph was seen during training (or closely resembles one), the model succeeds. When the composition is novel — requiring the model to apply computational rules to unseen combinations — it fails. This is shortcut learning: "may yield fast correct answers when similar compositional patterns are available during training but does not allow for robust generalization to uncommon or complex examples."

The error analysis is particularly revealing. While models can memorize single-step operations, they fail to compose them into correct reasoning paths. The failure is not random — it is systematic, suggesting "predictions based on shallow, rote learning rather than a deep, holistic task understanding." Error propagation makes this worse: errors in early stages compound in subsequent steps, creating an inherent ceiling on complex compositional tasks.

This provides the task-specific mechanism for what Do foundation models learn world models or task-specific shortcuts? describes at a higher level. The heuristic IS linearized subgraph matching — and it works well enough within the training distribution to create the illusion of systematic reasoning. Since Can neural networks learn compositional skills without symbolic mechanisms?, the Faith and Fate finding adds the critical qualifier: scaling helps only insofar as it increases training coverage of computation subgraphs. Novel compositions remain unsolved.

The implication for chain-of-thought: since Does logical validity actually drive chain-of-thought gains?, CoT may work not because it enables systematic reasoning but because it decomposes problems into subgraphs the model has already seen. CoT as subgraph decomposition rather than logical inference.

Inquiring lines that use this note as a source 69

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 125 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

compositional reasoning in transformers reduces to linearized subgraph matching — success depends on training exposure to similar computation subgraphs not systematic problem-solving