INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Retraining an AI to forget something doesn't work — the facts from original training live in layers fine-tuning never touches.

Why does fine-tuning fail to remove temporal contamination from pretraining?

This explores why fine-tuning can't scrub out facts, associations, or 'stale' temporal knowledge baked in during pretraining — and the corpus suggests the reason is architectural: fine-tuning and pretraining touch different parts of the model.

This explores why fine-tuning fails to remove what pretraining installed — temporal contamination being one case of the broader pattern that fine-tuning can't reach knowledge stored during pretraining. The corpus points to a clean structural explanation: pretraining and fine-tuning operate on different layers of the model, so fine-tuning is working in the wrong place to delete a pretrained fact.

The sharpest evidence is the architectural split. Scaling experiments show pretraining enriches factual knowledge in the model's lower layers while fine-tuning mostly modifies behavior expression in the upper layers Do pretraining and fine-tuning scale independently in language models?. Proxy-tuning makes the same point from the opposite direction: tuning at decoding time preserves pretrained knowledge precisely because direct fine-tuning *corrupts* lower-layer knowledge storage, whereas distributional nudges only touch reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. So fine-tuning changes how the model talks, not what it knows — and a stale temporal association lives in the part fine-tuning barely edits.

This is why pretrained priors keep winning at inference. Models routinely ignore information placed in their context when a strong training-time association points the other way; textual prompting alone can't override the prior, and only causal intervention in the representations does Why do language models ignore information in their context?. The same stubbornness shows up under adversarial conditions: poisoned data injected at pretraining survives standard safety alignment for most attack types How much poisoned training data survives safety alignment?. If alignment can't remove deliberately planted content, it's no surprise it can't remove incidentally absorbed temporal facts.

There's also a subtler reason fine-tuning leaves pretraining intact: it tends to *amplify* what's already there rather than overwrite it. RL post-training converges on a single dominant format already present in the pretraining distribution and suppresses the others Does RL training collapse format diversity in pretrained models?, and RL fine-tuning sharpens existing memorization rather than installing new procedures — models still collapse on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?. Fine-tuning re-weights and surfaces pretrained content; it doesn't perform deletion. Priming work reinforces this: whether a fact activates after a gradient update is predictable from its pre-learning probability, meaning the pretrained substrate sets the terms Can we predict keyword priming before learning happens?.

The thing worth taking away: 'removing' knowledge isn't what fine-tuning does at all. It's a behavior-shaping operation layered on top of a knowledge store it can dent but not erase. If you actually need to evict temporal contamination, the corpus hints the leverage is elsewhere — decoding-time interventions Can decoding-time tuning preserve knowledge better than weight fine-tuning?, parameter-isolation methods that target specific weight regions Can isolating task-specific parameters prevent multi-task fine-tuning interference?, or direct causal edits to representations — not more gradient steps over the same base.

Sources 8 notes

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Show all 8 sources

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining4.21 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?3.31 match · arxiv ↗
How new data permeates LLM knowledge and how to dilute it1.76 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!1.68 match · arxiv ↗
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs1.68 match · arxiv ↗
An Emulator for Fine-Tuning Large Language Models using Small Language Models1.67 match · arxiv ↗
Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels1.66 match · arxiv ↗
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts1.66 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether fine-tuning can remove temporal contamination from pretraining. This question was examined by a curated library spanning 2022–2026; treat those findings as dated claims to re-test, not current truth.

What a curated library found — and when (dated claims, not current truth):
• Pretraining enriches factual knowledge in lower layers; fine-tuning modifies upper layers (behavior expression), leaving lower-layer facts intact (2024–2025).
• Proxy-tuning at decoding time preserves pretrained knowledge better than direct fine-tuning because direct tuning corrupts lower-layer storage (2025).
• RL post-training amplifies existing pretraining distributions rather than overwriting them; models still collapse on out-of-distribution variants despite fine-tuning (2025).
• Pre-training poisoning at 0.1% of data persists through post-training alignment for most attack types (2024).
• Knowledge activation after gradient updates is predictable from pre-learning probability, meaning the pretrained substrate sets the terms (2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining (2025)
• arXiv:2410.13722 — Persistent Pre-Training Poisoning of LLMs (2024)
• arXiv:2508.21741 — Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance (2025)
• arXiv:2510.27062 — Consistency Training Helps Stop Sycophancy and Jailbreaks (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether newer models (post-2026 architectures), training methods (e.g., mechanistic fine-tuning, causal intervention during training), or evaluation harnesses have since RELAXED or OVERTURNED it. Separate the durable question (can fine-tuning ever truly delete pretraining-stored facts?) from perishable limitations (e.g., standard SGD cannot). What resolved it?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — e.g., does any paper show successful temporal-contamination removal, or argue the layer-separation thesis is incomplete?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Can mechanistic layer surgery + consistency constraints jointly erase temporal facts?" or "Does continual fine-tuning on rolling-window data eventually dilute pretraining contamination?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Retraining an AI to forget something doesn't work — the facts from original training live in layers fine-tuning never touches.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8