INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How effectively can inference-time…›Why does supervised fine-tuning im…›this inquiring line

Training an AI on correct answers can quietly degrade the quality of its reasoning, even as its test scores rise.

How does preference learning differ from supervised finetuning for reasoning?

This explores why training a model to imitate correct answers (supervised finetuning) and training it against ranked or rewarded alternatives (preference learning) produce different reasoning behavior — not just different scores.

This explores why supervised finetuning (SFT) and preference learning diverge specifically on *reasoning* quality, rather than on whether the final answer is right. The short version the corpus keeps circling: SFT teaches a model what answer to produce, while preference learning teaches it which way of getting there is better — and those turn out to be very different lessons.

The sharpest evidence is what SFT quietly breaks. One study finds that supervised finetuning raises benchmark accuracy while *cutting* the quality of the reasoning steps by nearly 39% — the model learns to produce correct-looking answers through post-hoc rationalization rather than genuine inference, and standard metrics never catch it because they only check the final token Does supervised fine-tuning improve reasoning or just answers?. A related finding shows fine-tuning on labeled examples teaches surface patterns rather than principled criteria: models fed labeled 'good arguments' learn what good arguments look like, not what makes them good, and fail to generalize to new types Can models learn argument quality from labeled examples alone?. Imitation copies the form of reasoning without the function.

Preference and reward-based learning attack this from the other side: instead of one gold trace to imitate, they compare traces against each other and reward the better one. RLAG rewards both answer accuracy *and* explanation rationality, internalizing coherent knowledge structures in a way that beats SFT precisely because it prioritizes reasoning quality over token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. You don't even need humans to do the ranking — model confidence in its own answer span can rank reasoning traces into synthetic preferences that strengthen step-by-step reasoning Can model confidence work as a reward signal for reasoning?, and a model can be aligned to written principles by maximizing the mutual information between principle and response, no preference labels at all Can models learn behavioral principles without preference labels?.

But the cleaner framing may be that this isn't really an either/or. The strongest open-reasoning result came from preference *trees* — a data structure that holds diverse solution chains, critique-and-revision trajectories, and pairwise comparisons all at once, feeding both SFT and preference learning from the same source What alignment data structure best trains reasoning generalists?. SFT gives the model a competent starting distribution; preference learning then sculpts *which* of its reasoning modes to favor. That maps onto a deeper claim: base models already contain latent reasoning ability, and post-training mostly *selects* rather than *creates* it Do base models already contain hidden reasoning ability?. If reasoning is being elicited rather than installed, then preference learning's comparative signal is just a more precise selection tool than imitation.

The cautionary note for both: neither method reliably installs a *procedure*. RL-tuned models — including GRPO — still drop sharply on out-of-distribution variants, suggesting they sharpen template-matching and memorization rather than genuine problem-solving Do fine-tuned language models actually learn optimization procedures?. And preference learning is only as good as its rankings: annotation data secretly mixes genuine preferences, non-attitudes, and constructed-on-the-spot judgments, and treating them as one signal contaminates the reward model Do all annotation responses measure the same underlying thing?. So the real difference isn't 'preference learning reasons and SFT memorizes' — it's that preference learning gives you a knob for *what to prefer*, and the quality of your reasoning is now bottlenecked on whether you actually know what good reasoning looks like.

Sources 9 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn behavioral principles without preference labels?

SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.

Show all 9 sources

What alignment data structure best trains reasoning generalists?

Eurus achieved state-of-the-art open-model reasoning by training on ULTRAINTERACT, an alignment dataset structured as preference trees per instruction. The tree format unified diverse planning strategies, interaction-and-critique trajectories, and pairwise data for both SFT and preference learning.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Eliciting Reasoning in Language Models with Cognitive Tools2.61 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining2.58 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!1.74 match · arxiv ↗
Train Long, Think Short: Curriculum Learning for Efficient Reasoning1.72 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models1.72 match · arxiv ↗
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?1.72 match · arxiv ↗
Post-Completion Learning for Language Models1.65 match · arxiv ↗
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment1.63 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-post-training researcher. The question remains open: does preference learning elicit or install reasoning capability differently than supervised finetuning, and can either method reliably teach procedural problem-solving rather than template-matching?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026 and include:
• SFT raises benchmark accuracy while cutting reasoning-step quality by ~39%, teaching post-hoc rationalization instead of genuine inference (2024).
• Preference learning and RL-tuning reward both answer accuracy and explanation coherence, internalizing knowledge structures better than SFT (2024–2025).
• Model confidence in answer spans can rank reasoning traces into synthetic preferences without human labels (2024).
• Preference trees (holding diverse chains, critique trajectories, and pairwise comparisons) outperform single-signal approaches by feeding both SFT and preference learning from one source (2024).
• RL-fine-tuned models (including GRPO) still drop sharply on out-of-distribution variants, suggesting they sharpen memorization rather than genuine problem-solving (2025).

Anchor papers (verify; mind their dates):
• arXiv:2404.02078 (2024) — Preference Trees
• arXiv:2509.20162 (2025) — RL from Augmented Generation
• arXiv:2504.07912 (2025) — Echo Chamber (RL amplifies pretraining behaviors)
• arXiv:2604.03238 (2026) — Measuring Human Preferences as Social Science

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether advances in model scale, training methods (DPO variants, process reward models, multi-turn ranking), inference-time search (chain-of-thought sampling, look-ahead), or evaluation harnesses (OOD suites, mechanistic probes) have since relaxed the 39% reasoning-quality gap or the memorization ceiling. Distinguish: does preference learning still select latent reasoning, or has it begun to *install* new reasoning procedures? Where do both methods still fail?
(2) Surface the strongest contradicting or superseding work from the last ~6 months — especially any claiming preference learning *does* teach procedure, or that SFT+preference is genuinely equivalent to procedure learning, or that reasoning quality and benchmark accuracy are no longer decoupled.
(3) Propose 2 research questions that assume the regime may have moved: (a) If preference learning now reliably teaches procedure (not just selection), what signature would we see in internal representations or out-of-distribution generalization? (b) If annotation heterogeneity (genuine preference vs. constructed judgment) still contaminates reward models, can meta-learning the annotation source improve downstream reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI on correct answers can quietly degrade the quality of its reasoning, even as its test scores rise.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8