SYNTHESIS NOTE

Topics›Reasoning by Reflection›this note

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

The Decoupling Knowledge and Reasoning paper proposes a testable two-phase model of LLM inference by contrasting fast thinking (no chain-of-thought) with slow thinking (CoT-enabled). Fast thinking engages Phase 1 only: knowledge retrieval from lower network layers. Slow thinking adds Phase 2: reasoning adjustment in higher layers. Comparing the two isolates each phase's contribution.

Across 15 LLMs on 3 datasets, three findings:

Domain-specificity of reasoning benefit: Phase 2 (reasoning adjustment) helps math, physics, and chemistry but can impair performance on knowledge-intensive domains. In medical tasks, the Phase 1 knowledge retrieved may be more reliable than the Phase 2 reasoning applied on top of it — reasoning adjustment introduces error rather than correcting it.

Scaling asymmetry: parameter scaling improves both phases, but knowledge improvement (Phase 1) dominates. Larger models know more, and this knowledge advantage outpaces the reasoning advantage. Scaling makes models more "prudent" (better at not making errors) across all domains, but only "more intelligent" (better at novel inference) in reasoning-intensive ones.

Layer localization: knowledge retrieval is primarily a lower-layer phenomenon; reasoning adjustment operates in higher layers. This is a functional architectural separation — not just a behavioral one.

The layer localization provides the mechanistic explanation for the SFT knowledge gap. CoT fine-tuning and RLVR modify higher-layer behavior. They cannot improve the lower-layer knowledge encoding that knowledge-intensive tasks depend on. Adding reasoning training to a model that lacks medical knowledge won't close the knowledge gap — it modifies a layer that isn't the bottleneck.

Architectural evidence for layer redundancy: The "Unreasonable Ineffectiveness of the Deeper Layers" (2403.17887) provides striking corroboration. Up to half of LLM layers can be pruned with minimal degradation on question-answering benchmarks, using a simple strategy: identify optimal block of layers to prune by cross-layer similarity, then heal with QLoRA finetuning on a single A100 GPU. This implies either that current pretraining methods are not properly leveraging the parameters in deeper layers, or that shallow layers play a disproportionately critical role in storing knowledge. Both interpretations reinforce the functional separation: if knowledge resides in lower layers, the deeper layers' contribution may be primarily redundant refinement rather than essential computation.

Retrieval heads as mechanistic evidence: The "Retrieval Head" paper provides direct causal evidence for layer specialization. A sparse set of attention heads (<5%) are responsible for retrieving relevant information from long context. These retrieval heads are: (1) universal across model families, (2) intrinsic — they exist in short-context models and persist through context-length extension, (3) dynamically activated — some always attend to required information while others activate contextually, and (4) causal — pruning them causes hallucination while pruning non-retrieval heads has no effect. Retrieval heads strongly influence CoT reasoning (which requires referring back to prior context) but minimally affect tasks where the model generates from intrinsic knowledge. This is a specific mechanistic instantiation of the lower-layer knowledge retrieval function described above. See What mechanism enables models to retrieve from long context?.

Latent concept hierarchy: The "Discovering Latent Concepts Learned in BERT" (2205.07237) confirms the layer hierarchy from a representation perspective. Lower layers dominate in learning shallow lexical concepts, while higher layers learn semantic relations. Critically, BERT learns novel concepts (e.g., animal categories, demographic groups) that do not adhere to predefined categorizations — the model discovers its own organizational structure. Several latent concepts are based on multiple properties spanning semantics, syntax, and morphology simultaneously, suggesting the layer separation is not clean but follows a general gradient.

The "Procedural Knowledge in Pretraining Drives Reasoning" paper provides the data-level explanation that complements this architectural finding. By ranking 5 million pretraining documents by their influence on model completions, they show that reasoning draws on a diffuse set of documents containing procedural knowledge (descriptions of how to solve), while factual recall draws on narrow document sets containing the target fact. This maps directly onto the layer separation: lower layers store memorized facts (requiring document-specific exposure), while higher layers encode procedural strategies (learnable from general demonstrations of method). See Does procedural knowledge drive reasoning more than factual retrieval?.

Inquiring lines that read this note 59

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does decoupling planning from execution improve multi-step reasoning accuracy?

Do integrated and decoupled architectures trade off intervention accuracy for efficiency differently?

How do neural networks separate factual knowledge from reasoning abilities?

How can LLM user simulators model realistic goal-driven conversation?

Why does content richness matter more than linguistic style in patient simulation?

Do base models contain latent reasoning that training can unlock?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

What mechanisms enable AI systems to generate and spread false beliefs?

What circuit mechanisms produce belief bias in syllogistic reasoning?

How does latent reasoning compare to verbalized chain-of-thought?

How do training data properties shape reasoning capability development?

Why does training format shape reasoning strategy more than domain content?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

What makes symbolic operations different from general knowledge questions?

What limits mechanistic interpretability's ability to characterize models?

Are detection and identification of injections truly separable in neural circuits?

Why do reasoning models fail at systematic problem-solving and search?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can high test performance mask a complete absence of understanding?

What constrains reinforcement learning's ability to expand model reasoning?

Can intrinsic reward signals extend beyond mathematics to medicine and law?

What properties determine whether reward signals teach genuine reasoning?

What information do numerical rewards fail to provide for reasoning tasks?

Does alignment training create blind spots in detecting genuine safety threats?

Can safety training and reasoning training be combined without losing calibration?

How does reasoning effort affect AI theory of mind performance?

How does reasoning graph topology affect breakthrough insights and generalization?

What role does curriculum design play in reasoning emergence?

How can models identify insufficient information and respond appropriately without guessing?

How can AI alignment serve diverse human preferences at scale?

Which application domains like healthcare and education lack alignment research?

Why does verification consistently lag behind AI generation?

What makes reasoning auditable in medical AI decision support?

What determines success in training models on multiple tasks?

Do interaction effects between research mechanisms depend on the task domain?

How do knowledge injection methods compare across cost and effectiveness?

Which domains need knowledge injection versus reasoning-focused training?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

24 direct connections · 206 in 2-hop network ·medium cluster Open in graph ↗

Why does reasoning training help math but hurt m… Does medical AI need knowledge or reasoning more? Why doesn't mathematical reasoning transfer to med… Do language models actually use their encoded know… Can text-trained models compress images better tha…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does medical AI need knowledge or reasoning more? Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?
layer localization is the mechanistic explanation for the behavioral pattern this note documents
Why doesn't mathematical reasoning transfer to medicine? Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
transfer fails because SFT modifies higher-layer reasoning while the bottleneck is lower-layer knowledge; this paper makes that precise
Do language models actually use their encoded knowledge? Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
layer localization explains the encoding-generation gap: knowledge in lower layers may be overridden by higher-layer reasoning adjustments that introduce error, producing the failure mode where the model "knows" the answer but generates an incorrect one
Can text-trained models compress images better than specialized tools? Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
the compression framing maps onto the layer separation: lower layers compress facts (document-specific memorization), higher layers compress procedures (generalizable reasoning); the scaling caveat on adjusted compression may reflect redundancy in deeper layers

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

knowledge resides in lower network layers and reasoning in higher layers — this functional separation explains why reasoning training helps math but can impair knowledge-intensive domains

Why does reasoning training help math but hurt medical tasks?

Inquiring lines that read this note 59

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4