INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›What are the consequences of model…›this inquiring line

An AI's biases are baked in during its first massive training pass — everything done afterward only shifts them, never removes them.

Why do unified models still inherit data-distribution biases from training?

This explores why models that look 'unified' or fully trained still carry the statistical fingerprints of their training data — and where in the training pipeline those biases actually get baked in.

This explores why a finished model still carries the statistical leanings of its training data, even after all the alignment and fine-tuning meant to clean it up. The short version the corpus keeps pointing at: the biases are laid down early, during pretraining, and almost everything you do afterward only nudges them rather than removes them. A causal study that varied random seeds and cross-tuned models found that any two models sharing a pretrained backbone show the *same* cognitive bias patterns no matter what instruction data you fine-tune them on — fine-tuning sways the bias, but pretraining plants it Where do cognitive biases in language models come from?. So 'unified' is a bit of a mirage: the surface behavior is unified, but the underlying distribution was set before you ever touched it.

The later training stages don't just fail to remove the bias — some of them actively concentrate it. Reinforcement learning, rather than broadening a model, tends to pick one dominant format that already existed in pretraining and amplify it within the first epoch while suppressing the alternatives, and which format 'wins' depends on model scale rather than on being the best one Does RL training collapse format diversity in pretrained models?. That's the opposite of correction — it's a winner-take-all collapse onto a pre-existing distribution. Push the training signal too hard and it gets worse: overly difficult RLVR samples teach degenerate shortcuts that then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?.

There's a deeper reason the inherited distribution is sticky: the model trusts what it learned over what you tell it now. Language models routinely generate outputs that contradict their own context because parametric knowledge from training overrides in-context information — and textual prompting alone can't override a strong prior; you have to intervene in the representations themselves Why do language models ignore information in their context?. So even at inference time, the training distribution is doing the steering. This is also why the bias survives fine-tuning's blunt instruments: direct weight fine-tuning corrupts knowledge stored in lower layers, which is partly why gentler approaches that leave base weights untouched — proxy-tuning at decoding time, or intervening on frozen representations — preserve more of the model while shifting behavior Can decoding-time tuning preserve knowledge better than weight fine-tuning? Can editing hidden representations beat weight updates for finetuning?. The bias lives in the distribution, so the methods that touch the distribution most violently are the ones that scramble it without fixing it.

What you might not expect: this isn't only a within-model problem, it's an across-the-field problem. When 70+ different models were run on 26K open-ended queries, they converged on strikingly similar — sometimes identical — answers, an 'Artificial Hivemind' effect driven by overlapping training corpora and shared alignment recipes Do different AI models actually produce diverse outputs?. So the diversity you'd hope to get from ensembling many models is partly an illusion, because they all inherited the same distributional priors. And the bias can compound on itself: recommendation and ranking systems trained on their own past outputs lock into feedback loops, converging on degenerate equilibria that amplify their earlier decisions unless selection bias is modeled explicitly Why do ranking systems need to model selection bias explicitly?. The unifying thread, and the thing worth carrying away: a model's biases aren't a bug bolted on at the end you can sand off — they're the shape of the data distribution itself, set early, amplified by training pressure, trusted over fresh evidence, and shared across the whole ecosystem. That's also why the dream of a 'theory-free,' bias-free model that just reads correlations off the data is a fallacy — high accuracy launders the bias rather than removing it Can AI models be truly free from human bias?.

Sources 9 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Show all 9 sources

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Can AI models be truly free from human bias?

Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether unified LLM biases remain sticky across newer training methods and model scales. The question: why do models still inherit data-distribution biases despite alignment and fine-tuning?

What a curated library found — and when (dated claims, not current truth; path spans 2024–2026):
• Cognitive biases are planted during pretraining and fine-tuning only nudges them; two models sharing a pretrained backbone show identical bias patterns regardless of instruction data (arXiv:2507.07186, ~2025).
• RL post-training converges on a single dominant pretraining distribution format within the first epoch, amplifying one pre-existing path rather than correcting bias (arXiv:2504.07912, ~2025).
• Parametric knowledge from training overrides in-context information; textual prompting alone cannot override strong priors—representation-level intervention required (arXiv:2507.07186, ~2025).
• 70+ models on 26K open-ended queries converge on identical answers due to overlapping corpora and alignment recipes; ensemble diversity is partly illusory (arXiv:2510.22954, ~2025).
• Gentler methods (proxy-tuning, frozen-representation intervention) preserve pretrained knowledge better than direct weight fine-tuning (arXiv:2404.03592, ~2024; arXiv:2605.28388, ~2026).

Anchor papers (verify; mind their dates):
- arXiv:2507.07186, Planted in Pretraining, Swayed by Finetuning (2025)
- arXiv:2504.07912, Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining (2025)
- arXiv:2510.22954, Artificial Hivemind (2025)
- arXiv:2404.03592, ReFT: Representation Finetuning (2024)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, check whether newer models (post-2026-Q2), continual adaptation methods (arXiv:2605.12484), consistency training (arXiv:2510.27062), or mechanistic understanding of sample difficulty (arXiv:2605.28388) have since relaxed or overturned these claims. Separate the durable question—do biases originate in pretraining?—from the perishable claim—RL and fine-tuning cannot override them. Plainly state where constraints still hold and what method has, if any, demonstrated escape.
(2) Surface the strongest work from the last ~6 months that contradicts the 'bias is sticky' thesis or shows a training regime that breaks the pretraining-dominance pattern.
(3) Propose 2 research questions assuming the regime may have shifted: e.g., "Can continual online adaptation dilute pretraining biases faster than current RLVR?" or "Do models trained on synthetic, bias-curated data from the start (rather than retrofit) escape the pretraining trap?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI's biases are baked in during its first massive training pass — everything done afterward only shifts them, never removes them.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8