SYNTHESIS NOTE

Does supervised fine-tuning improve reasoning or just answers?

Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.

Synthesis note · 2026-02-21 · sourced from Domain Specialization

Post angle for Medium / LinkedIn

Hook: "Every AI benchmark measures accuracy. What if accuracy is exactly the wrong thing to measure when deploying AI in high-stakes domains?"

The finding: The Knowledge or Reasoning paper introduces two new metrics — Knowledge Index (KI: factual correctness of each reasoning step) and Information Gain (InfoGain: how much each reasoning step reduces uncertainty toward the final answer). When they apply these metrics to SFT-trained models on medical and mathematical tasks, they find that SFT raises final-answer accuracy while cutting InfoGain by 38.9%. Models get more answers right while reasoning toward them less informationally.

The mechanism: SFT rewards answers, not reasoning paths. Training data has question-answer pairs. The loss function anchors on the correct final output. Models learn the most efficient path to the right answer in the training distribution — often domain-specific shortcuts, pattern matches, and frequency-weighted heuristics that produce the correct answer without the inferential chain that would justify it. The reasoning in the output becomes post-hoc rationalization.

Why this matters for deployment: High-stakes domains don't just need correct answers — they need auditable reasoning. Medical decision support must show clinical logic. Legal AI must demonstrate how conclusions follow from statute and precedent. Financial AI must show how recommendations connect to market data and regulatory context. SFT improves the answer, but may make the reasoning path less meaningful — more verbose decoration around the correct output than the pathway that produced it.

The measurement problem: Standard benchmarks measure what's easy to measure: whether the final answer matches the ground truth. InfoGain and KI require decomposing reasoning chains and evaluating each step against external ground truth — expensive and difficult to automate at scale. So the measurement gap persists, and every organization that deploys based on benchmark accuracy is systematically blind to the reasoning quality regression.

The connection: This extends the existing cluster of overthinking findings into the training dimension. Does extended thinking actually improve reasoning or just increase variance? at inference-time. Does reasoning fine-tuning make models worse at declining to answer? at training-time for a different cost (calibration). The SFT accuracy trap is the third entry: training-time cost to reasoning quality.

Platform: Medium (1000–1400 words). Could lead with the FALM / medical AI deployment angle, then introduce the measurement framework.

Inquiring lines that read this note 98

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can AI-generated outputs constitute genuine knowledge or valid claims?

Does AI fluency substitute for verifiable accuracy in human judgment?

What makes AI persuasion effective and how can we counter it?

Why do persuasive AI techniques also reduce factual accuracy?

Why do benchmark improvements fail to reflect actual reasoning quality?

Does RLHF training sacrifice accuracy and grounding for user agreement?

How does AI assistance affect human cognitive development and reasoning autonomy?

How does AI assistance differ from search engines in cognitive impact?

How do training data properties shape reasoning capability development?

How can models identify insufficient information and respond appropriately without guessing?

What capability tradeoffs emerge when scaling model reasoning abilities?

How can humans calibrate appropriate trust in AI systems?

Can organized response format trick users into overestimating AI reliability?

When do additional thinking tokens stop improving reasoning performance?

Do tokens beyond a critical threshold actually improve reasoning quality?

How do training priors constrain what context information can override?

Does training on critiques of noisy responses produce deeper understanding than imitating correct ones?

How should inference compute be adaptively allocated based on prompt difficulty?

Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why does verification consistently lag behind AI generation?

Do base models contain latent reasoning that training can unlock?

How much reasoning catalyst data is actually needed for improvement?

Can ensemble evaluation methods reduce bias more than single judges?

What properties determine whether reward signals teach genuine reasoning?

Can critic model trios evaluate reasoning quality more reliably than outcome rewards alone?

What makes specific clarifying questions more effective than generic ones?

How does reasoning effort affect AI theory of mind performance?

How do we evaluate AI systems when user perception misleads actual performance?

Can AI evaluation match human judgment quality in structured domain tasks?

Does reinforcement learning teach reasoning or just when to reason?

How do knowledge injection methods compare across cost and effectiveness?

What constrains reinforcement learning's ability to expand model reasoning?

What alternatives to RLHF better preserve truth-seeking in AI outputs?

How should models express uncertainty rather than forced confident answers?

What makes a first answer so often the best answer a model produces?

How do evaluation biases undermine LLM quality assessment systems?

Why does automated evaluation consistently overestimate research quality?

How should iterative research systems allocate reasoning per search step?

Does unrestricted reasoning per search step degrade iterative quality over time?

How does latent reasoning compare to verbalized chain-of-thought?

Can we improve reasoning by amplifying information at mutual information peaks?

Why does training format shape reasoning strategy more than domain content?

Why does training data format shape reasoning strategy more than content?

How do adversarial and manipulative prompts attack reasoning models?

Can adversarial critics force genuine reasoning the same way critique fine-tuning does?

How does example difficulty affect learning efficiency in language models?

Why do explicit quality criteria outperform learning quality from examples alone?

How should personalization be implemented to improve AI assistant effectiveness?

Can model confidence signals reliably improve reasoning quality and calibration?

Why do reasoning models fail at systematic problem-solving and search?

What quality filters distinguish useful reasoning enrichment from shallow repetition?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Does decoupling reasoning from tool use actually improve accuracy?

Can prompting inject entirely new knowledge into language models?

Can structured questioning prompts improve reasoning beyond standard conversational training?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Can single-problem fine-tuning match full RL pipeline reasoning gains?

Why do agents confidently report success despite actually failing tasks?

What other agent behaviors besides citations reveal reasoning quality?

How do neural networks separate factual knowledge from reasoning abilities?

How do procedural versus factual knowledge differ in pretraining versus fine-tuning?

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 234 in 2-hop network ·dense cluster Open in graph ↗

Does supervised fine-tuning improve reasoning or… Does supervised fine-tuning actually improve reaso… Does reasoning fine-tuning make models worse at de… Does extended thinking actually improve reasoning … Does critiquing errors teach deeper understanding … Why do better reasoning models ignore instructions… Can language models solve ToM benchmarks without r… Why does SFT-then-RL training follow a predictable… Does preference optimization harm conversational u…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does supervised fine-tuning actually improve reasoning quality? While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
the underlying insight this post dramatizes
Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
parallel SFT cost: calibration vs. reasoning quality
Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
inference-time version of the same accuracy vs. quality trade-off
Does critiquing errors teach deeper understanding than imitating correct answers? Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
counter-strategy: CFT addresses the SFT accuracy trap by replacing correct-answer imitation with structured failure analysis as the training objective
Why do better reasoning models ignore instructions? As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?
a third dimension of SFT/RL training cost: SFT degrades reasoning quality (this note), reasoning training degrades instruction adherence (instruction-following deficit), and both reflect the same pattern — optimizing one capability structurally degrades another
Can language models solve ToM benchmarks without real reasoning? Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM benchmarks are a concrete case of the SFT accuracy trap: SFT achieves competitive ToM scores without reasoning training, suggesting benchmarks reward structural pattern exploitation rather than genuine mental state reasoning
Why does SFT-then-RL training follow a predictable three-phase pattern? When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
temporal dynamics: CHORD reveals the SFT accuracy trap as the first phase of a three-phase progression; RL can recover from SFT's reasoning degradation but only if SFT and RL are integrated as a continuous spectrum rather than hard-sequenced stages
Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
parallel training-induced degradation: SFT degrades reasoning quality (this note) while RLHF degrades conversational grounding; both demonstrate that optimizing for what benchmarks and raters measure structurally erodes capabilities that require different evaluation frameworks
Can identical outputs hide broken internal representations? Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER provides the representational diagnosis: SFT may produce fractured internal representations that yield correct answers through pattern-matching shortcuts while the underlying structure is broken in ways standard benchmarks cannot detect
Does fine-tuning disconnect reasoning steps from final answers? When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
a second dimension of SFT damage beyond InfoGain: fine-tuning reduces how much reasoning steps causally influence the final answer, making the chain performative rather than functional; together with InfoGain degradation, SFT damages both reasoning quality and reasoning faithfulness

Does supervised fine-tuning improve reasoning or just answers?

Inquiring lines that read this note 98

Related concepts in this collection 10

Related papers in this collection 8

Search by related questions 4