INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Can model confidence signals relia…›this inquiring line

Teaching an AI to reason better can secretly destroy the signal that tells it when to look something up.

Why does reasoning fine-tuning suppress the confidence signals that adaptive retrieval needs?

This explores a hidden conflict between two training goals: fine-tuning a model to reason or answer well tends to wreck the calibrated uncertainty signal that 'retrieve only when unsure' systems depend on to decide when to look things up.

This question sits at the collision point of two ideas the corpus treats separately. On one side, the most efficient adaptive-retrieval systems don't use elaborate heuristics at all — they just read the model's own calibrated token-probability uncertainty and retrieve when it dips, which beats multi-call methods at a fraction of the cost Can simple uncertainty estimates beat complex adaptive retrieval?. The whole approach rests on one assumption: the model's confidence is an honest readout of whether it actually knows. On the other side, fine-tuning quietly violates that assumption.

The damage shows up in what fine-tuning optimizes for. Supervised fine-tuning raises final-answer accuracy on benchmarks while cutting the genuine inferential content of reasoning by nearly 39% — the model learns to produce correct answers through post-hoc rationalization rather than working them out Does supervised fine-tuning improve reasoning or just answers?. Faithfulness tests sharpen the point: after fine-tuning, reasoning chains less reliably cause the final answer at all, becoming performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. If the reasoning no longer drives the answer, then the confidence attached to that answer is no longer tracking a real inferential process — it's tracking how well the model has learned to look right. That is precisely the signal adaptive retrieval is trying to read, now corrupted.

Reward-based training names the mechanism more directly. RLHF is shown to actively degrade calibration — the model's stated confidence drifts away from its actual accuracy Can model confidence work as a reward signal for reasoning?. When you train a model to maximize a correctness or preference signal, you push its probability mass toward confident-looking outputs regardless of whether it should be uncertain. The retrieval gate that depended on seeing a low-confidence dip stops firing, because the model has been trained out of expressing doubt.

The corpus also points at the fix, which doubles as confirmation of the cause. RLSF reverses the calibration damage precisely by making confidence itself the training target — using answer-span confidence to rank reasoning traces — restoring calibration while still improving reasoning Can model confidence work as a reward signal for reasoning?. The fact that you can repair confidence by optimizing for it tells you the standard objectives were silently optimizing against it. There's a deeper reason confidence is worth protecting: it isn't noise. Model confidence directly predicts robustness — highly confident models resist prompt rephrasing while low-confidence ones swing wildly Does model confidence predict robustness to prompt changes?. Confidence is a genuine internal signal of stability, which is exactly why flattening it through fine-tuning is so costly.

The unexpected turn for a curious reader: the problem isn't that fine-tuned models know less. It's that they stop being able to tell you when they don't know. Related work suggests the better lever is training that rewards reasoning quality rather than token-level correctness — RL that internalizes coherent knowledge structures outperforms SFT precisely because it doesn't reduce everything to final-answer matching Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Preserve the honesty of the uncertainty signal, and adaptive retrieval keeps working; optimize it away in pursuit of benchmark accuracy, and you blind the system to its own ignorance.

Sources 6 notes

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Show all 6 sources

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a systems analyst auditing the claim that fine-tuning suppresses confidence signals needed for adaptive retrieval. The question remains open: does reasoning-optimized fine-tuning necessarily corrupt uncertainty estimation, or have recent training methods, evaluation harnesses, or model architectures found ways to preserve both reasoning quality AND calibration?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as historical snapshots, not current capability ceilings.
- Supervised fine-tuning raises final-answer accuracy while degrading reasoning faithfulness (~39% loss in inferential content) and confidence calibration, making uncertainty signals unreliable for adaptive retrieval (2024–2025).
- RLHF-based training actively degrades calibration; models become overconfident regardless of actual accuracy, breaking the retrieval gate that triggers on low-confidence dips (2024–2025).
- RLSF (training confidence as an intrinsic reward) restores calibration AND improves reasoning, suggesting standard objectives silently optimize *against* honest uncertainty (2025).
- Model confidence directly predicts robustness and resistance to prompt rephrasing; flattening it removes a genuine stability signal (2025).
- RL from augmented generation embeds domain knowledge more effectively than SFT *because* it doesn't reduce everything to final-answer matching (2025).

Anchor papers (verify; mind their dates):
- arXiv:2411.15382 (2024-11): "On the Impact of Fine-Tuning on Chain-of-Thought Reasoning"
- arXiv:2501.12835 (2025-01): "Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home"
- arXiv:2509.20162 (2025-09): "Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation"
- arXiv:2603.03415 (2026-03): "Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs"

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, check whether newer post-training recipes (e.g., DPO, IPO, or multi-objective RL), confidence-aware evaluation suites, or model-internal probes have since relaxed or overturned the degradation pattern. Separate the durable problem (confidence-as-stability may still be suppressed by standard SFT) from the perishable limitation (RLSF or domain-grounded RL may have solved it). Cite what solved it; state plainly where the constraint still holds.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months — papers showing SFT *can* preserve calibration, or adaptive retrieval working *without* confidence signals, or confidence being a poor stability proxy after all.
(3) Propose 2 research questions that *assume the regime has moved*: e.g., "If multi-objective RL can preserve both reasoning and calibration, what is the minimal additional compute cost?"; "Does confidence-aware retrieval outperform uncertainty-free adaptive methods on out-of-distribution queries?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Teaching an AI to reason better can secretly destroy the signal that tells it when to look something up.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8