Why does general reasoning not transfer to knowledge-intensive medical domains?
This explores why a model that reasons well on math or logic still stumbles in medicine — and the corpus's answer is that medicine is bottlenecked by what the model *knows*, not by how well it *thinks*.
This explores why a model that reasons well on math or logic still stumbles in medicine. The short version from the corpus: medicine and math fail for opposite reasons. Math is *reasoning-dominant* — performance tracks the quality of the reasoning chain. Medicine is *knowledge-dominant* — accuracy correlates far more with whether the model holds the correct domain facts than with how elegantly it reasons over them Does medical AI need knowledge or reasoning more?. So when you take a model fine-tuned to reason on math problems and point it at clinical questions, the reasoning skill arrives intact but lands on top of a shaky factual base, and the base is what's actually graded Why doesn't mathematical reasoning transfer to medicine?.
There's a striking mechanistic reason this happens, and it's about *where* in the network these two things live. Knowledge retrieval operates in the lower layers of the network; reasoning adjustment happens in the higher layers. Reasoning training reshapes the upper layers without refilling the lower ones — which is exactly why the same training that improves math can quietly *degrade* a knowledge-heavy domain like medicine Why does reasoning training help math but hurt medical tasks?. You're tuning the part of the machine that wasn't the bottleneck.
Why doesn't reasoning skill itself carry knowledge along with it? Because the two are acquired differently in the first place. Reasoning generalizes because it draws on broad, transferable *procedural* knowledge scattered across many pretraining documents — the 'how to work through this' patterns. Factual recall depends on narrow, document-specific memorization of the exact target fact Does procedural knowledge drive reasoning more than factual retrieval?. A model can learn a portable procedure from a million sources but cannot infer a drug interaction it never memorized. Worse, models trained mostly on general text are not just wrong in specialized domains — they're *confidently* wrong, and the prompting tricks that help in general settings fail to dent that overconfidence Why do language models fail confidently in specialized domains?.
Here's the part you might not expect: the most effective domain interventions don't add knowledge at all — they *remove* it. Reinforcement learning improves medical reasoning largely by pruning reasoning paths that invoke wrong facts, suppressing bad domain knowledge rather than expanding what the model knows (one study reports a +12.4 point gain from this pruning alone) Does RL improve domain reasoning by adding knowledge or removing it?. That reframes the whole problem: if the win comes from filtering out bad recall, the constraint was never reasoning capacity.
The corpus also hints at two escape routes. One is to stop relying on the model's internal store entirely — interleave reasoning with live external lookups (a Wikipedia query, a tool call) so each step is grounded in real facts instead of memorized guesses, which beats pure chain-of-thought by 10–34% on knowledge-intensive tasks Can interleaving reasoning with real-world feedback prevent hallucination?. The other is to fix the pretraining itself: expert texts are the surface residue of hidden expert thinking, and reconstructing that buried reasoning-plus-recall during training produces skills that *do* transfer across domains Can reconstructing expert thinking improve reasoning transfer?. The throughline: in medicine, smarter thinking can't substitute for knowing the right thing.
Sources 8 notes
The KI/InfoGain framework reveals that medical domain accuracy correlates more strongly with knowledge correctness than reasoning quality, while mathematical domains show the inverse pattern. This distinction has direct implications for which training strategies to prioritize in each domain.
R1-distilled reasoning models fail to outperform base models on medical tasks because knowledge accuracy matters more than reasoning quality in medicine—the opposite of math. Fine-tuning cannot close this gap without domain-specific training data.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
LLMs trained on general text lack sufficient exposure to domain-specific examples, leading to low accuracy paired with high confidence in clinical NLI tasks. Prompting techniques that improved general performance fail to reduce overconfidence in specialized domains.
RL enhances medical reasoning by suppressing incorrect domain knowledge during reasoning—not by expanding what models know. Evidence shows RL achieves +12.4 point knowledge improvement by removing low-reward reasoning trajectories that invoke wrong facts.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
Training on expert texts augmented with reconstructed thought processes (self-talk, knowledge recall, verification) produces reasoning skills that transfer across domains and adapt depth to problem difficulty, outperforming standard continual pretraining by up to 8 points on hard problems.