INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

Fine-tuning an AI to say the right things may silently erase some of what it genuinely learned.

How do early layers preserve unbiased information while late layers conform?

This reads the question as: do the deeper, earlier parts of a model hold onto raw pretrained knowledge while a later stage gets pulled toward a single 'approved' output — and the corpus speaks to this more through the stages of training than through literal layer-by-layer probes.

This explores a 'depth vs. conformity' split — the idea that some part of a model keeps the raw, varied information it learned while another part narrows toward one sanctioned answer. The corpus doesn't have a single study that probes layer-by-layer, but it maps this tension cleanly onto two related distinctions: *lower layers vs. reasoning/style*, and *pretraining vs. post-training*. The clearest direct hit is proxy-tuning, where researchers find that direct fine-tuning actually corrupts knowledge stored in the lower layers, while a decoding-time approach leaves those weights untouched and only shifts reasoning and style — recovering most of the alignment benefit without damaging what the base model knew Can decoding-time tuning preserve knowledge better than weight fine-tuning?. That's your 'early preserves, late conforms' picture in miniature: the knowledge substrate is fragile and worth protecting, the conformity lives nearer the output.

The conformity half of the question has a sharp answer in how reinforcement learning behaves. RL post-training doesn't teach a model new formats — it picks one format already present from pretraining and amplifies it within the first epoch while suppressing the alternatives, and which one wins depends on model scale rather than quality Does RL training collapse format diversity in pretrained models?. So 'late layers conform' is better read as 'late *training* collapses diversity': the variety was there, and the final stage funnels it into one dominant channel. The flip side is that staying close to the base distribution — low KL drift — is what preserves a model's ability to keep learning; pushed too far from base, models stall when the task domain changes Does staying close to the base model preserve learning ability?.

But here's the twist that should reframe the whole question: 'early = unbiased' is the part the corpus pushes back on hardest. A causal experiment varying random seeds and cross-tuning found that cognitive biases are *planted during pretraining* and merely swayed by instruction tuning — models sharing a backbone show the same bias patterns regardless of finetuning data Where do cognitive biases in language models come from?. So the deep, 'preserved' information isn't neutral; it's where the bias originates. The conformity you see late is downstream of priors set early, not a corruption of an innocent base.

That reframing connects to why models ignore their own context: when parametric knowledge from training is strong, it dominates in-context information, and textual prompting alone can't override it — you need causal intervention in the internal representations Why do language models ignore information in their context?. The 'early' knowledge isn't just preserved, it actively overrides fresh signal. And the broader caution: high accuracy is not the same as unbiased reasoning — 'theory-free' models can mask bias behind strong metrics while making correlation-causation errors Can AI models be truly free from human bias?.

The thing you might not have known you wanted: the corpus inverts the premise. Early structure isn't the unbiased keeper of truth and late structure the conformist — early structure is where bias is *baked in*, and the real engineering lever is keeping post-training close enough to base to preserve plasticity and knowledge, while accepting that 'preserved' never meant 'neutral.'

Sources 6 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Show all 6 sources

Can AI models be truly free from human bias?

Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about layer-wise information preservation in LLMs. The question remains open: do early layers genuinely preserve unbiased information while late layers enforce conformity?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable snapshots:
- Proxy-tuning at decoding time preserves pretrained knowledge in lower layers better than direct fine-tuning, recovering alignment without corrupting the base substrate (~2025).
- RL post-training doesn't create new formats; it amplifies one pretraining distribution within the first epoch, suppressing alternatives based on model scale (~2025).
- Cognitive biases are planted during pretraining and merely modulated by instruction tuning — early layers encode bias, not neutrality (~2025).
- Low KL drift from base model preserves plasticity; high drift causes models to stall on domain shift (~2026).
- Models ignore context when parametric knowledge is strong; textual prompting alone cannot override pretraining associations (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2504.07912 (2025-04) — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- arXiv:2507.07186 (2025-07) — Planted in Pretraining, Swayed by Finetuning: Cognitive Bias Origins
- arXiv:2605.12484 (2026-05) — Learning, Fast and Slow: Continual Adaptation in LLMs
- arXiv:2510.27062 (2025-10) — Consistency Training and Sycophancy Control

Your task:
(1) RE-TEST THE INVERSION. The library flips the premise: early ≠ unbiased, late ≠ merely conforming. For each finding above, determine whether newer model scales (o1, Claude 3.5, Llama-405B), improved fine-tuning methods (LoRA variants, spectral regularization), or evaluation tools (mechanistic interpretability, causal probing) have since relaxed the bias-in-pretraining constraint or shown that instruction tuning can truly neutralize early-layer priors. Separate: Is the *tension* (early vs. late) still real, or has the regime dissolved it?
(2) Surface the strongest work from ~6 months onward that contradicts the "early bias is destiny" thesis — particularly any showing instruction tuning *does* override pretraining, or early layers *do* decouple from bias under specific conditions.
(3) Propose two questions assuming the regime may have moved: (a) Can layer-wise regularization (e.g., sparsity, orthogonality constraints) decouple early information from its bias without proxy-tuning? (b) Does continual adaptation (as per 2605.12484) allow models to *selectively unlearn* biased early-layer structure while keeping factual knowledge?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Fine-tuning an AI to say the right things may silently erase some of what it genuinely learned.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8