SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Does supervised fine-tuning improve reasoning or just answers?

Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.

Synthesis note · 2026-02-21 · sourced from Domain Specialization
How do you build domain expertise into general AI models? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Post angle for Medium / LinkedIn

Hook: "Every AI benchmark measures accuracy. What if accuracy is exactly the wrong thing to measure when deploying AI in high-stakes domains?"

The finding: The Knowledge or Reasoning paper introduces two new metrics — Knowledge Index (KI: factual correctness of each reasoning step) and Information Gain (InfoGain: how much each reasoning step reduces uncertainty toward the final answer). When they apply these metrics to SFT-trained models on medical and mathematical tasks, they find that SFT raises final-answer accuracy while cutting InfoGain by 38.9%. Models get more answers right while reasoning toward them less informationally.

The mechanism: SFT rewards answers, not reasoning paths. Training data has question-answer pairs. The loss function anchors on the correct final output. Models learn the most efficient path to the right answer in the training distribution — often domain-specific shortcuts, pattern matches, and frequency-weighted heuristics that produce the correct answer without the inferential chain that would justify it. The reasoning in the output becomes post-hoc rationalization.

Why this matters for deployment: High-stakes domains don't just need correct answers — they need auditable reasoning. Medical decision support must show clinical logic. Legal AI must demonstrate how conclusions follow from statute and precedent. Financial AI must show how recommendations connect to market data and regulatory context. SFT improves the answer, but may make the reasoning path less meaningful — more verbose decoration around the correct output than the pathway that produced it.

The measurement problem: Standard benchmarks measure what's easy to measure: whether the final answer matches the ground truth. InfoGain and KI require decomposing reasoning chains and evaluating each step against external ground truth — expensive and difficult to automate at scale. So the measurement gap persists, and every organization that deploys based on benchmark accuracy is systematically blind to the reasoning quality regression.

The connection: This extends the existing cluster of overthinking findings into the training dimension. Does extended thinking actually improve reasoning or just increase variance? at inference-time. Does reasoning fine-tuning make models worse at declining to answer? at training-time for a different cost (calibration). The SFT accuracy trap is the third entry: training-time cost to reasoning quality.

Platform: Medium (1000–1400 words). Could lead with the FALM / medical AI deployment angle, then introduce the measurement framework.

Inquiring lines that use this note as a source 93

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
22 direct connections · 245 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the sft accuracy trap — training raises benchmark scores while degrading reasoning quality