Does supervised fine-tuning actually improve reasoning on optimization problems?

When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?

Synthesis note · 2026-05-18 · sourced from Reasoning Architectures

The constraint-optimization study runs a controlled comparison between SFT and RL (with constraint-satisfaction rewards) on the same problem class. The SFT result is the diagnostic of interest: SFT clearly improves the form of the answer — JSON structure, decimal places, valid identifiers, expected sections — without improving the feasibility of the answer against the actual physical constraints. The model learns to look like it is solving the problem.

This is the formatting-vs-feasibility gap, and it is a specific instance of a more general SFT failure mode. SFT trains the model to reproduce the surface features of correct demonstrations. The surface features of a feasible solution and the surface features of a confidently-wrong solution are nearly identical. SFT optimizes the loss on the visible tokens, not on whether those tokens encode a valid physical state. The result is fluently presented infeasibility.

RL with feasibility-targeted rewards moves the needle modestly on actual feasibility, because the reward signal directly penalizes the constraint violations that SFT could not see. This is a real but limited gain — it does not break the 55-60% plateau, but it disambiguates which kind of failure SFT was leaving uncorrected.

The methodological implication for fine-tuning practice: when the desired behavior involves correctness in a dimension the loss does not measure, SFT improvements should be treated with suspicion. A clean rise in benchmark score where the benchmark scores presentation rather than substance can simply mean the model has gotten better at looking right.

Inquiring lines that read this note 25

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can alternative training methods improve on supervised fine-tuning for language models?

Why does DPO outperform SFT specifically for function calling tasks?

Which computational strategies best support reasoning in language models?

Can closed-form solutions compete with gradient descent optimization?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

How do training data properties shape reasoning capability development?

Why does self-revision increase model confidence while degrading accuracy?

Why does most refinement in iterative models maintain answers rather than improve them?

Do autonomous architecture discoveries follow predictable scaling laws?

How does Goodhart's Law apply when safety measures become optimization targets?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

Why do production teams choose expensive frontier models over fine-tuning?

Can prompting inject entirely new knowledge into language models?

What knowledge can prompt optimization actually activate in trained models?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

What capability tradeoffs emerge when scaling model reasoning abilities?

Why do reasoning models fail to improve constrained optimization performance?

What constrains reinforcement learning's ability to expand model reasoning?

How does Supervised RL bridge the gap between SFT and RLVR?

Can model confidence signals reliably improve reasoning quality and calibration?

Why does convergence stability sometimes mislead about reasoning correctness?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Why should decomposition be diagnosed and fixed separately from solving?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Does supervised fine-tuning actually improve rea… Do large language models actually perform iterativ… Do fine-tuned language models actually learn optim… Do larger language models solve constrained optimi… What do models actually learn from chain-of-though…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does supervised fine-tuning actually improve reasoning on optimization problems?

Inquiring lines that read this note 25

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4