INQUIRING LINE

Why does general reasoning not transfer to knowledge-intensive medical domains?

This explores why a model that reasons well on math or logic still stumbles in medicine — and the corpus's answer is that medicine is bottlenecked by what the model *knows*, not by how well it *thinks*.


This explores why a model that reasons well on math or logic still stumbles in medicine. The short version from the corpus: medicine and math fail for opposite reasons. Math is *reasoning-dominant* — performance tracks the quality of the reasoning chain. Medicine is *knowledge-dominant* — accuracy correlates far more with whether the model holds the correct domain facts than with how elegantly it reasons over them Does medical AI need knowledge or reasoning more?. So when you take a model fine-tuned to reason on math problems and point it at clinical questions, the reasoning skill arrives intact but lands on top of a shaky factual base, and the base is what's actually graded Why doesn't mathematical reasoning transfer to medicine?.

There's a striking mechanistic reason this happens, and it's about *where* in the network these two things live. Knowledge retrieval operates in the lower layers of the network; reasoning adjustment happens in the higher layers. Reasoning training reshapes the upper layers without refilling the lower ones — which is exactly why the same training that improves math can quietly *degrade* a knowledge-heavy domain like medicine Why does reasoning training help math but hurt medical tasks?. You're tuning the part of the machine that wasn't the bottleneck.

Why doesn't reasoning skill itself carry knowledge along with it? Because the two are acquired differently in the first place. Reasoning generalizes because it draws on broad, transferable *procedural* knowledge scattered across many pretraining documents — the 'how to work through this' patterns. Factual recall depends on narrow, document-specific memorization of the exact target fact Does procedural knowledge drive reasoning more than factual retrieval?. A model can learn a portable procedure from a million sources but cannot infer a drug interaction it never memorized. Worse, models trained mostly on general text are not just wrong in specialized domains — they're *confidently* wrong, and the prompting tricks that help in general settings fail to dent that overconfidence Why do language models fail confidently in specialized domains?.

Here's the part you might not expect: the most effective domain interventions don't add knowledge at all — they *remove* it. Reinforcement learning improves medical reasoning largely by pruning reasoning paths that invoke wrong facts, suppressing bad domain knowledge rather than expanding what the model knows (one study reports a +12.4 point gain from this pruning alone) Does RL improve domain reasoning by adding knowledge or removing it?. That reframes the whole problem: if the win comes from filtering out bad recall, the constraint was never reasoning capacity.

The corpus also hints at two escape routes. One is to stop relying on the model's internal store entirely — interleave reasoning with live external lookups (a Wikipedia query, a tool call) so each step is grounded in real facts instead of memorized guesses, which beats pure chain-of-thought by 10–34% on knowledge-intensive tasks Can interleaving reasoning with real-world feedback prevent hallucination?. The other is to fix the pretraining itself: expert texts are the surface residue of hidden expert thinking, and reconstructing that buried reasoning-plus-recall during training produces skills that *do* transfer across domains Can reconstructing expert thinking improve reasoning transfer?. The throughline: in medicine, smarter thinking can't substitute for knowing the right thing.


Sources 8 notes

Does medical AI need knowledge or reasoning more?

The KI/InfoGain framework reveals that medical domain accuracy correlates more strongly with knowledge correctness than reasoning quality, while mathematical domains show the inverse pattern. This distinction has direct implications for which training strategies to prioritize in each domain.

Why doesn't mathematical reasoning transfer to medicine?

R1-distilled reasoning models fail to outperform base models on medical tasks because knowledge accuracy matters more than reasoning quality in medicine—the opposite of math. Fine-tuning cannot close this gap without domain-specific training data.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why do language models fail confidently in specialized domains?

LLMs trained on general text lack sufficient exposure to domain-specific examples, leading to low accuracy paired with high confidence in clinical NLI tasks. Prompting techniques that improved general performance fail to reduce overconfidence in specialized domains.

Does RL improve domain reasoning by adding knowledge or removing it?

RL enhances medical reasoning by suppressing incorrect domain knowledge during reasoning—not by expanding what models know. Evidence shows RL achieves +12.4 point knowledge improvement by removing low-reward reasoning trajectories that invoke wrong facts.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Can reconstructing expert thinking improve reasoning transfer?

Training on expert texts augmented with reconstructed thought processes (self-talk, knowledge recall, verification) produces reasoning skills that transfer across domains and adapt depth to problem difficulty, outperforming standard continual pretraining by up to 8 points on hard problems.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing claims about why general reasoning fails in knowledge-intensive medical domains. The question remains open: what is the actual bottleneck—reasoning capacity, factual recall, or something else entirely?

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2025. A curated library identified these recurring constraints:

• Medicine is knowledge-dominant, not reasoning-dominant; accuracy correlates far more with correct domain facts than reasoning quality, unlike math (2025, arXiv:2506.02126).

• Knowledge retrieval operates in lower network layers; reasoning adjustment in higher layers. Reasoning fine-tuning reshapes upper layers without refilling lower ones, sometimes *degrading* knowledge-heavy domains (2025, arXiv:2507.18178).

• Procedural knowledge (how-to patterns) generalizes across pretraining documents; factual recall depends on narrow, document-specific memorization. Models cannot infer facts they never explicitly memorized (2024, arXiv:2411.12580).

• RL improves medical reasoning by *pruning* bad reasoning paths (±12.4 point gain), not expanding internal knowledge; constraint was filtering, not capacity (2024–2025 range).

• Interleaved reasoning + live external tool use beats pure chain-of-thought by 10–34% on knowledge tasks; grounding each step in real facts bypasses hallucination (2025, arXiv:2506.12115).

Anchor papers (verify; mind their dates):
- arXiv:2506.02126 (2025): Knowledge or Reasoning? A Close Look
- arXiv:2507.18178 (2025): Decoupling Knowledge and Reasoning
- arXiv:2411.12580 (2024): Procedural Knowledge in Pretraining
- arXiv:2505.20296 (2025): Reasoning LLMs are Wandering Solution Explorers

Your task:

(1) RE-TEST EACH CONSTRAINT. For the claim that knowledge resides in lower layers and reasoning in higher ones, does that mapping still hold under o1-style reasoning models, post-training on synthetic expert reasoning, or recent tool-use architectures? Separately: have newer medical LLMs (HuatuoGPT-o1, domain-specific variants) overcome the knowledge bottleneck via continual pretraining, retrieval augmentation, or knowledge graphs, and if so, which papers ground that escape? Flag where constraints remain rock-solid.

(2) Surface the strongest CONTRADICTING work from the last ~6 months—any paper showing reasoning *does* transfer to medicine, or that the knowledge/reasoning split is a false dichotomy, or that the layer hypothesis breaks down under certain conditions.

(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) If reasoning LLMs now wander solution space rather than execute fixed chains, does that wandering naturally surface missing knowledge, and can it be leveraged? (b) Do synthetic expert texts reconstructed from pretraining actually encode domain-specific reasoning *and* facts jointly, collapsing the transfer problem?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines