INQUIRING LINE

Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?

This explores whether fine-tuning can actually teach a model to reason about meaning — or whether it mostly sharpens the surface shortcuts a model already leans on (word frequency, output format, answer-matching), and what alternative training setups break that pattern.


This explores whether fine-tuning can actually teach semantic inference rather than just amplifying shortcuts. The corpus is sobering on the default case and more hopeful on the alternatives. The clearest indictment is that fine-tuning on natural language inference makes models lean *harder* on a frequency trick — preferring whichever word appears more often in the corpus — rather than learning what actually entails what; the giveaway is that they get worse on adversarial cases where frequency and truth disagree Does fine-tuning on NLI teach inference or amplify shortcuts?. The same pattern recurs across very different setups: standard supervised fine-tuning raises benchmark accuracy while cutting the actual inferential content of the reasoning by nearly 39%, so the model arrives at right answers through post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?, and even RL fine-tuning often just sharpens template-matching — performance collapses on slightly out-of-distribution variants of the same problem Do fine-tuned language models actually learn optimization procedures?.

The most unsettling result in the collection suggests the shortcut runs deeper than we'd guess: instruction tuning works almost as well when you train on *semantically empty or deliberately wrong* instructions as on correct ones (43% vs. a 42.6% baseline). What transfers isn't task understanding — it's familiarity with the shape of the output Does instruction tuning teach task understanding or output format?. A related line shows fine-tuning actively loosens the causal link between a model's reasoning steps and its answer: cut the chain short, paraphrase it, or stuff it with filler, and the answer barely changes — the reasoning has become performance, not function Does fine-tuning disconnect reasoning steps from final answers?.

So what flips it? The corpus keeps pointing to the same lever: train on the *quality of the inference*, not just the correctness of the token. Reinforcement learning from augmented generation rewards explanation rationality alongside the answer, cycling between seeing and not seeing the source until coherent knowledge structures internalize — and it beats SFT precisely because it stops optimizing token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. DPO does something similar by feeding explicit *wrong* examples, which directly targets the failure modes plain SFT papers over Can small models match large models on function calling?. And RLVR appears to work by adjusting only the ~20% of high-entropy 'forking' tokens where a real decision happens — evidence that genuine reasoning improvement lives in a specific, identifiable signal rather than in blanket imitation Do high-entropy tokens drive reasoning model improvements?.

There's also a structural escape hatch worth knowing about. Some methods avoid corrupting the model's knowledge at all: proxy-tuning steers behavior at decoding time and preserves pretrained knowledge far better, because direct fine-tuning damages the lower layers where facts are stored Can decoding-time tuning preserve knowledge better than weight fine-tuning?. And Quiet-STaR teaches a model to generate rationales at every token during pretraining on ordinary internet text — letting reasoning competence emerge as a *byproduct* of better language modeling rather than from a labeled inference dataset Can models learn reasoning from predicting any text?.

The deeper lesson the collection leaves you with: 'shortcut vs. inference' isn't really about fine-tuning yes-or-no — it's about what your reward measures. When training grades only the final answer, models learn the cheapest route to that answer, which is almost always a surface correlate. When training grades the *reasoning* — through verifiable explanation, negative examples, or entropy-targeted signals — semantic inference becomes the thing being selected for. Two adjacent findings sharpen the boundary: argument-quality judgment won't transfer from labeled examples alone but does improve when you hand the model an explicit theoretical framework Can models learn argument quality from labeled examples alone?, and prompting can only reorganize knowledge already present, never inject what's missing Can prompt optimization teach models knowledge they lack? — a reminder that some of what looks like 'failure to learn inference' is really a ceiling set long before fine-tuning began.


Sources 12 notes

Does fine-tuning on NLI teach inference or amplify shortcuts?

NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability analyst. The question remains open: **Can fine-tuning teach semantic inference instead of amplifying training shortcuts?** This matters because if standard fine-tuning just locks in surface correlations, we need to know which methods (if any) actually internalize reasoning.

**What a curated library found — and when (dated claims, not current truth):**
These findings span 2023–2025. They paint a sobering baseline and emerging alternatives:

- Standard supervised fine-tuning on NLI raises benchmark accuracy by ~39% while degrading actual inferential reasoning; models arrive at correct answers via post-hoc rationalization rather than genuine entailment (~2024–2025).
- Fine-tuning amplifies frequency bias: models prefer words that appear more often in training data over what actually entails what; performance collapses on adversarial cases where frequency contradicts truth (~2025).
- Instruction tuning transfers nearly equally well (43% vs. 42.6% baseline) from *semantically empty or deliberately wrong* instructions, suggesting the model learns output-format distribution, not task understanding (~2023).
- RL from augmented generation (reward on explanation rationality + correctness) and DPO (explicit negative examples) outperform SFT; high-entropy minority tokens (~20%) appear to be where real reasoning decisions fork (~2024–2025).
- Proxy-tuning at decoding time and Quiet-STaR (rationale generation during pretraining on raw text) both preserve or build reasoning without corrupting lower-layer knowledge (~2024).

**Anchor papers (verify; mind their dates):**
- arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
- arXiv:2403.09629 (2024) — Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- arXiv:2509.20162 (2025) — Embedding Domain Knowledge via RL from Augmented Generation
- arXiv:2506.01939 (2025) — High-Entropy Minority Tokens Drive Effective RL

**Your task:**

(1) **RE-TEST EACH CONSTRAINT.** For each finding above, assess whether newer models (GPT-4o, Claude 3.5, o1-series reasoning models), training methods (test-time scaling, process reward models, tool use), or evaluation harnesses have since relaxed or overturned it. Separate the durable question ("do we still see shortcut amplification in standard SFT?") from the perishable limitation ("can RL + auxiliary rewards fix it?"). Cite what resolved each, and flag where constraints still hold.

(2) **SURFACE STRONGEST CONTRADICTING WORK.** Hunt the last ~6 months for papers showing standard fine-tuning *does* teach genuine inference, or that shortcut amplification is artifact of weak evaluation, or that newer scaling/orchestration (multi-agent, memory, tool orchestration) dissolves the boundary between shortcuts and reasoning.

(3) **PROPOSE 2 FRONTIER QUESTIONS** that assume the regime may have moved: (a) If high-entropy token steering works, can we isolate a minimal fine-tuning signal (e.g., gradient masking on low-entropy tokens) that preserves reasoning without RL overhead? (b) Do process reward models + chain-of-thought verify solve the post-hoc rationalization problem, or do they themselves learn shallow metrics?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines