INQUIRING LINE

How do transformers perform multi-hop reasoning across distant training documents?

This explores whether transformers genuinely chain facts learned from separate documents into new conclusions — and what mechanism (or trick) is actually doing the work when they appear to.


This reads the question as being about real multi-hop reasoning — stitching together facts that were never seen together — rather than recall of a single passage. The corpus suggests transformers can do this, but only after a specific developmental sequence, and even then the mechanism is shakier than it looks. Controlled training studies find that the ability doesn't appear all at once: models pass through memorization, then in-distribution generalization, then genuine cross-distribution reasoning, with success showing up as a measurable 'cosine clustering' of related entity representations — and crucially, the second hop only generalizes if the model was explicitly exposed to compositional examples during training How do transformers learn to reason across multiple steps?. So 'reasoning across distant documents' is something transformers learn to do, not something they do by default.

The uncomfortable counterpoint is what that learned ability actually is. One line of work argues that what looks like systematic composition is really linearized subgraph matching — the model memorizes computation paths from training and replays them, succeeding in-distribution but collapsing on novel combinations, with errors compounding at each hop Do transformers actually learn systematic compositional reasoning?. This rhymes with findings that LLMs reason semantically, not symbolically: decouple the meaning from the logical structure and accuracy falls apart even when the correct rules are sitting right there in context Do large language models reason symbolically or semantically?. Chain-of-thought, the usual scaffold for multi-step work, turns out to be distribution-bounded in the same way — fluent but logically inconsistent once you shift task, length, or format, because it imitates the form of reasoning rather than performing it Does chain-of-thought reasoning actually generalize beyond training data? What makes chain-of-thought reasoning actually work?.

There's a deeper architectural reason the 'distant documents' framing is tricky. Transformers don't file knowledge away as retrievable records — they transmit it as flowing activations through the residual stream, closer to an oral performance than a library lookup, which is exactly why model knowledge is contextual, hard to edit, and inseparable from the act of generating Do transformer models store knowledge or generate it continuously?. If knowledge is flow rather than storage, 'combining two facts' isn't retrieving two files and joining them; it's getting two activation patterns to interact in a single forward pass. That also helps explain a surprising fragility: reasoning accuracy drops sharply just from longer inputs — from 92% to 68% with a few thousand tokens of irrelevant padding, far below the context limit — meaning the more material a model has to hold across, the worse the chaining gets Does reasoning ability actually degrade with longer inputs?.

Where does the actual computation happen? Logit-lens probing shows models can compute the correct multi-step answer in their earliest layers, then overwrite it in later layers to produce format-compliant filler — the reasoning is real and recoverable, but the visible output sometimes hides it Do transformers hide reasoning before producing filler tokens?. This is the surprising part for a curious reader: the chain-of-thought you see on the page is not necessarily where the reasoning lives.

The most interesting thread is the workarounds for the limits above. Quiet-STaR pushes reasoning into pretraining itself, teaching the model to generate rationales at every token on arbitrary internet text so multi-hop competence emerges as a side effect of better language modeling rather than from task-specific data Can models learn reasoning from predicting any text?. On the retrieval side, hypergraph memory binds three or more facts into a single relation so multi-step evidence accumulates without being decomposed into lossy pairwise links — explicitly engineering the cross-document binding the base architecture struggles with Can hypergraphs capture multi-hop reasoning better than graphs?. And a separate camp questions the transformer itself: a hierarchical recurrent model couples slow planning with fast computation to escape the fixed-depth complexity ceiling that caps how many genuine reasoning steps a standard transformer can chain at all Can recurrent hierarchies achieve reasoning that transformers cannot?. Read together, the corpus says transformers fake multi-hop reasoning well within their training distribution and break outside it — and the frontier is split between teaching the chaining earlier, storing facts in a more combinable structure, or changing the architecture so the depth limit stops biting.


Sources 11 notes

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can hypergraphs capture multi-hop reasoning better than graphs?

HGMem organizes retrieved evidence as hyperedges rather than flat lists or binary graphs, allowing three or more entities to bind into single relations without decomposition. This structure accumulates coherent knowledge across retrieval steps, trading representational complexity for constraint expressiveness.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst revisiting the question: Do transformers genuinely perform multi-hop reasoning across distant training documents, or do they simulate it within learned distributions?

What a curated library found — and when (findings span 2023–2025; treat as dated claims):
• Transformers pass through three developmental stages: memorization → in-distribution generalization → cross-distribution reasoning, with entity-representation 'cosine clustering' as a measurable marker (2025).
• What appears to be compositional reasoning often reduces to linearized subgraph matching — models memorize computation paths and replay them, failing on novel combinations with compounding errors per hop (2024–2025).
• Chain-of-thought reasoning is distribution-bounded; accuracy degrades predictably when task, length, or format shifts, indicating imitation of reasoning form rather than true execution (2024–2025).
• Reasoning performance drops from 92% to 68% accuracy with just a few thousand tokens of irrelevant padding — far below context window limits — showing acute fragility to input length (2024).
• Logit-lens probing reveals correct multi-step answers computed in early layers, then overwritten by later layers to produce format-compliant outputs; the reasoning is real but hidden (2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023) — LLMs as in-context semantic (not symbolic) reasoners.
• arXiv:2402.14848 (2024) — Input length degrades reasoning even far below context window.
• arXiv:2403.09629 (2024) — Quiet-STaR: rationale generation at token level during pretraining.
• arXiv:2508.01191 (2025) — Chain-of-thought as distribution-bounded mirage; data lens analysis.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether newer training paradigms (instruction-tuning scale, constitutional AI, reasoning-specific pretraining), inference methods (beam search, best-of-N rollouts, test-time compute scaling), or architecture shifts (mixture-of-experts routing, explicit memory modules, sparse attention) have since relaxed or overturned these limits. Plainly distinguish the durable question — *can transformers learn genuine multi-step composition?* — from perishable limitations — *current models fail at distribution shift, input length, or novel chains* — and cite what broke each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months: papers claiming transformers do exhibit robust cross-distribution reasoning, or that scale/training alone resolves the subgraph-matching problem.
(3) Propose 2 research questions that assume the regime may have moved: one probing whether reasoning robustness has improved with post-training techniques; one asking whether the residual-stream-as-flow model still holds under modern optimization.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines