SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Does RL improve domain reasoning by adding knowledge or removing it?

When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.

Synthesis note · 2026-02-21 · sourced from Domain Specialization
How do you build domain expertise into general AI models? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The Knowledge or Reasoning paper's KI/InfoGain framework allows a more precise account of what RL training does to domain reasoning than "RL makes models better at reasoning." The specific finding: RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, improving both reasoning accuracy and knowledge correctness (average KI gain of +12.4 points). RL does not appear to add new domain facts the model didn't know — it makes the model less likely to invoke incorrect domain knowledge when reasoning.

This is a structurally different claim than most framing of RL's role. The standard story is: SFT gives the model capability, RL aligns that capability to desired behavior. But in domain-specific contexts, the alignment function is more specific: RL is suppressing the wrong pattern activations during reasoning, not teaching the model new things.

The medical AI context makes this clear. Medical reasoning is knowledge-dominant — knowledge accuracy correlates more strongly with final accuracy than reasoning quality across benchmarks. SFT raises knowledge levels (KI +6.2% on medical tasks) but also introduces verbose or suboptimal reasoning, reducing InfoGain. RL corrects this: it rewards factual correctness and penalizes paths that introduce inaccurate medical claims, effectively performing a kind of knowledge path surgery on the training distribution.

This connects to but is distinct from Does policy entropy collapse limit reasoning performance in RL?, which describes RL's effect on exploration diversity during training. That insight is about the training dynamics that limit scaling; this one is about the mechanism through which RL improves domain-specific output quality. The two claims operate at different levels: collapse is a training-time constraint, pruning is a mechanism-of-action description.

The practical implication: for knowledge-intensive domains, RL is not optional enhancement on top of SFT — it is the correction mechanism that compensates for SFT's tendency to memorize answer patterns rather than reason correctly. Does supervised fine-tuning actually improve reasoning quality? documents the problem RL is solving.

The "RL Squeezes, SFT Expands" paper provides graph-topology evidence for this pruning mechanism. RL training compresses the diversity of reasoning paths while SFT expands them — this compression IS the pruning. RL doesn't add new paths; it removes low-reward ones, concentrating probability mass on the subset of reasoning trajectories that lead to correct outcomes. Since Does negative reinforcement alone outperform full reinforcement learning?, the pruning mechanism may be RL's primary contribution: suppressing wrong paths matters more than reinforcing right ones.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 177 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl improves domain reasoning by pruning inaccurate knowledge from reasoning paths not by adding capability