Does supervised fine-tuning actually improve reasoning on optimization problems?
When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?
The constraint-optimization study runs a controlled comparison between SFT and RL (with constraint-satisfaction rewards) on the same problem class. The SFT result is the diagnostic of interest: SFT clearly improves the form of the answer — JSON structure, decimal places, valid identifiers, expected sections — without improving the feasibility of the answer against the actual physical constraints. The model learns to look like it is solving the problem.
This is the formatting-vs-feasibility gap, and it is a specific instance of a more general SFT failure mode. SFT trains the model to reproduce the surface features of correct demonstrations. The surface features of a feasible solution and the surface features of a confidently-wrong solution are nearly identical. SFT optimizes the loss on the visible tokens, not on whether those tokens encode a valid physical state. The result is fluently presented infeasibility.
RL with feasibility-targeted rewards moves the needle modestly on actual feasibility, because the reward signal directly penalizes the constraint violations that SFT could not see. This is a real but limited gain — it does not break the 55-60% plateau, but it disambiguates which kind of failure SFT was leaving uncorrected.
The methodological implication for fine-tuning practice: when the desired behavior involves correctness in a dimension the loss does not measure, SFT improvements should be treated with suspicion. A clean rise in benchmark score where the benchmark scores presentation rather than substance can simply mean the model has gotten better at looking right.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does DPO outperform SFT specifically for function calling tasks?
- Can closed-form solutions compete with gradient descent optimization?
- How does non-reasoning SFT prevent overfitting before RL training begins?
- How does critique fine-tuning on one problem unlock broader reasoning?
- Why does most refinement in iterative models maintain answers rather than improve them?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- Why does fine-tuning improve some capabilities while degrading others?
- How does Goodhart's Law apply when safety measures become optimization targets?
- Why do production teams choose expensive frontier models over fine-tuning?
- What knowledge can prompt optimization actually activate in trained models?
- Why does KTO skip supervised fine-tuning while DPO cannot?
- Does fine-tuning actually change model capabilities or only output distribution?
- Why does SFT reduce reasoning quality even when improving domain accuracy?
- How does task-oriented fine-tuning compare to preference tuning methods?
- Does SFT degrade reasoning quality while improving domain accuracy?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- Why do reasoning models fail to improve constrained optimization performance?
- How does Supervised RL bridge the gap between SFT and RLVR?
- Does preference tuning help or hurt the exploration of solution spaces in code?
- Does fine-tuning a small model match fine-tuning a large one?
- Why does SFT fail when expert demonstrations are too long for small models?
- Why does parameter-efficient tuning scaling fail to improve finetuning performance?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the underlying shortcut SFT reinforces
-
Do fine-tuned language models actually learn optimization procedures?
Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.
same paper, the OOD diagnostic
-
Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
same paper, the parent ceiling
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
adjacent: form-over-content in CoT training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- Learning to Reason for Factuality
- Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
- Foundations of Large Language Models
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Original note title
SFT improves response formatting but not physical feasibility — formatting wins mask reasoning shortcuts