How do transformers learn to reason across multiple steps?
Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?
Training transformers from scratch in a controlled symbolic environment reveals that implicit multi-hop reasoning — answering compositional queries without verbalizing intermediate steps — emerges through three distinct developmental stages:
Phase I: Memorization. The model fits training data (atomic facts and 2-hop compositions) quickly. Generalization to unseen queries remains minimal.
Phase II: In-Distribution Generalization. After memorization saturates, the model begins generalizing to unseen ID-ID compositions — a shift from memorization to compositional reasoning within the training distribution. This resembles grokking: generalization emerges well after memorization converges.
Phase III: Cross-Distribution Reasoning. The model learns to compose OOD triples in the first hop with ID triples in the second. This transition is slower than Phase II. Crucially, generalization fails consistently when the SECOND hop is from OOD triples, revealing a stronger bottleneck in the second relational step.
Two mechanistic findings deepen the picture:
Cosine clustering as signature. Successful reasoning correlates with consistent clustering of intermediate entity representations within cosine similarity space. Models that reason well show intermediate representations that cluster by entity identity across diverse queries. This clustering provides a geometric explanation for when reasoning works and when it fails.
Query-level exposure is required. Second-hop generalization fails unless the model encounters the exact compositional structure during training. Single-hop knowledge does not automatically compose into multi-hop capability — a finding that helps explain why Do language models actually use their encoded knowledge?: encoding facts individually doesn't guarantee they compose.
Grokking provides parallel three-phase evidence. The "Progress Measures for Grokking via Mechanistic Interpretability" paper reverse-engineers the grokking phenomenon in transformers trained on modular addition, revealing three continuous phases that closely parallel the three developmental stages above: (1) memorization — the model fits training data quickly, (2) circuit formation — structured mechanisms gradually amplify in the weights (the generalizing circuit emerges), and (3) cleanup — memorizing components are removed. The parallel between memorization → ID generalization → cross-distribution reasoning and memorization → circuit formation → cleanup suggests a shared underlying dynamic: generalization requires extended training well beyond the point of memorization, and proceeds through the gradual formation of structured internal mechanisms. The grokking paper confirms this with a mechanistic explanation: the generalizing circuit uses discrete Fourier transforms and trigonometric identities. See What happens inside models when they suddenly generalize?.
The three-stage trajectory has implications for understanding RL-trained reasoning models. Since Do base models already contain hidden reasoning ability?, the question becomes: which stage does RL training target? If RL primarily accelerates Phase II (ID generalization), it explains why Does the choice of RL algorithm actually matter for reasoning? — different algorithms may trigger the same phase transition.
Inquiring lines that use this note as a source 24
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What is selective resonance and why do transformers not perform it?
- How do transformers perform multi-hop reasoning across distant training documents?
- Do transformers learn generalizable algorithms or instance-based patterns?
- What graph structures better support multi-hop reasoning than pairwise edges?
- Do grokking phases correspond to transitions between nesting levels?
- How does error propagation limit transformer performance on complex tasks?
- Can symbolic mechanisms improve transformer compositional abilities?
- How do humans and LMs differ on multi-hop reasoning?
- Can we detect and measure circuit formation before generalization emerges?
- What hidden computations happen inside transformer layers during reasoning?
- Can we decode what individual circuits inside transformers are doing?
- Why are pairwise relations insufficient for representing higher-order multi-hop reasoning?
- How does layer removal affect transformers compared to ResNets?
- How do transformers generate harder solutions when mostly trained on easier problems?
- What explains the contextual variability of knowledge in transformers?
- Why does second-hop reasoning fail when composed with out-of-distribution triples?
- Does grokking in modular arithmetic follow the same three-phase learning trajectory?
- Can single-hop knowledge automatically compose into multi-hop capability?
- What data properties enable transformers to learn sequential decision-making in context?
- How do transformers stitch together learned behaviors when adapting to new tasks?
- Can energy-based transformers achieve deep reasoning without supervision?
- Can sparse attention methods be designed specifically for multi-hop reasoning tasks?
- Does Gemma's transformer explicitly exploit the inherited hierarchical geometry?
- What computational stages does a looped block re-enact across multiple iterations?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
encoding ≠ composition; this paper shows the mechanism for when composition emerges
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
three-stage emergence framework for understanding what "unlocking" means
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
RL may target specific phase transitions in the emergence trajectory
-
Do reasoning cycles in hidden states reveal aha moments?
What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
cosine clustering is a representational-level analogue to the topological "aha moment"
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
shared condition: both findings show compositional reasoning requires training exposure to the compositional structure, not just individual components; query-level exposure (this note) and task-space coverage (that note) are the same constraint at different scales
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- How do Transformers Learn Implicit Reasoning?
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- Implicit Chain of Thought Reasoning via Knowledge Distillation
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- A Mechanistic Analysis of Looped Reasoning Language Models
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- Faith and Fate: Limits of Transformers on Compositionality
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
Original note title
implicit multi-hop reasoning in transformers emerges through three developmental stages with cosine clustering as the mechanistic signature