INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Training an AI on new data quietly shifts which source it trusts: what you just told it, or what it already knew.

Why does fine-tuning change how models process retrieved context?

This explores what fine-tuning actually does to a model's relationship with information sitting in its context window — whether it changes how, or how much, the model leans on what it retrieves versus what it already 'knows.'

This reads the question as being about the tug-of-war between two sources of knowledge inside a model: the parametric knowledge baked into its weights, and the in-context information it retrieves at run time. The corpus suggests fine-tuning doesn't just teach new facts — it quietly reweights which of those two sources wins. The starting point is that even a base model already ignores its context when prior associations are strong: parametric knowledge from training dominates, and textual prompting alone can't override it Why do language models ignore information in their context?. Fine-tuning pushes harder in that same direction, because most post-training sharpens what the model already has rather than installing genuinely new procedures Do fine-tuned language models actually learn optimization procedures?.

The most concrete mechanism for 'how context gets processed' is retrieval heads — fewer than 5% of attention heads do the actual work of pulling facts out of long context, and they're causally necessary: prune them and the model hallucinates even when the answer is sitting right there What mechanism enables models to retrieve from long context?. Because this machinery is so sparse and specific, fine-tuning that nudges attention patterns can degrade context-faithfulness without touching benchmark accuracy. That's exactly what the faithfulness work finds: after fine-tuning, a model's reasoning chains less reliably drive its answers — truncate them, paraphrase them, or stuff them with filler, and the answer often stays the same. The reasoning becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?.

Here's the thing a curious reader might not expect: this is often a side effect of misallocating where adaptation lives. Fast-Slow Training shows that routing task-specific lessons into the prompt (fast, textual) while keeping weight updates minimal (slow) reaches the same performance faster and with far less catastrophic forgetting — framing forgetting as a misallocation problem, not an inherent cost of learning Can splitting adaptation into two channels reduce forgetting?. The implication runs backward into the question: when you cram adaptation into the weights instead, you're effectively overwriting the model's openness to its own context. And there's a hard ceiling on the other side too — prompting and context can only reactivate knowledge already in the training distribution; they can't inject what was never there Can prompt optimization teach models knowledge they lack?.

RL-style fine-tuning shows the same fingerprint from a different angle. It tends to collapse onto a single dominant format inherited from pretraining within the first epoch, suppressing alternatives regardless of which is better Does RL training collapse format diversity in pretrained models?. A model funneled toward one rigid output mode is, almost by definition, a model that treats incoming context more as a cue to trigger a memorized template than as evidence to reason over Do fine-tuned language models actually learn optimization procedures?.

The takeaway the reader didn't know they wanted: 'processing retrieved context' isn't one knob but a balance between sparse retrieval circuitry and dominant priors — and fine-tuning is one of the most reliable ways to tip that balance toward the priors. If you want models that stay genuinely responsive to what they retrieve, the corpus points toward keeping adaptation in the fast, textual channel and watching the retrieval heads, rather than baking everything into the weights. Worth knowing too: this brittleness compounds — once context starts filling with a model's own errors, performance degrades non-linearly, and only test-time compute, not more fine-tuning, reins it back in Do models fail worse when their own errors fill the context?.

Sources 8 notes

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

What mechanism enables models to retrieve from long context?

Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Show all 8 sources

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining2.62 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!2.53 match · arxiv ↗
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning1.74 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.69 match · arxiv ↗
Learning To Retrieve Prompts for In-Context Learning1.68 match · arxiv ↗
Retrieval Head Mechanistically Explains Long-Context Factuality0.89 match · arxiv ↗
Train Long, Think Short: Curriculum Learning for Efficient Reasoning0.87 match · arxiv ↗
Learning, Fast and Slow: Towards LLMs That Adapt Continually0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating how fine-tuning alters in-context reasoning in LLMs. The question remains open: does fine-tuning degrade context-faithfulness mechanistically, or have recent training methods, architectural innovations, or evaluation frameworks since relaxed this constraint?

What a curated library found — spanning 2023–2026, so treat as dated claims, not current truth:
• Retrieval heads (<5% of attention heads) causally drive long-context factuality; fine-tuning can degrade their function without harming benchmark accuracy (~2024).
• Fine-tuning reduces chain-of-thought faithfulness: truncating or paraphrasing reasoning steps leaves answers unchanged, suggesting adaptation rewrites weights rather than strengthening reasoning circuits (~2024).
• RL post-training collapses onto a single dominant pretraining-inherited format within the first epoch, suppressing contextual sensitivity (~2025).
• Fast-Slow Training (splitting textual and weight-based adaptation) avoids catastrophic forgetting and preserves context-openness; weight-heavy fine-tuning misallocates where learning lives (~2026).
• Self-conditioning amplifies errors non-linearly when context accumulates model mistakes; only test-time compute, not fine-tuning, recovers performance (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2404.15574 (2024) – Retrieval Head Mechanistically Explains Long-Context Factuality
• arXiv:2411.15382 (2024) – On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
• arXiv:2504.07912 (2025) – Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
• arXiv:2605.12484 (2026) – Learning, Fast and Slow: Towards LLMs That Adapt Continually

Your task:
(1) RE-TEST the brittleness claim. Has architectural change (e.g., mixture-of-experts routing, sparse attention), training innovation (e.g., consistency training ~2510.27062, recursive models ~2512.24601), or new evaluation harnesses (e.g., long-horizon execution benchmarks ~2509.09677) since OVERTURNED the finding that fine-tuning suppresses context use? Separate the durable observation (weights can encode rigid priors) from the perishable limitation (fine-tuning *necessarily* degrades context-openness). Cite what resolved it; flag where it still holds.

(2) Surface the strongest *contradicting or reconciling* work from the last 6 months—especially anything showing fine-tuning *preserves* or *enhances* context-fidelity under specific regimes (curriculum learning ~2508.08940, domain injection ~2502.10708, or new data permeation ~2504.09522).

(3) Propose 2 research questions that assume the regime may have shifted:
   — Does continual fine-tuning with a Fast-Slow split and retrieval-head-aware regularization preserve context-responsiveness across distribution shifts?
   — Can recursive or hierarchical models (where reasoning is embedded in the forward pass rather than side-stepped by memorization) structurally resist the collapse RL post-training induces?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI on new data quietly shifts which source it trusts: what you just told it, or what it already knew.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8