INQUIRING LINE

Does alignment compound cultural bias that started during pretraining?

This explores whether the alignment/fine-tuning stage (RLHF, preference optimization) creates cultural bias or whether it inherits and concentrates bias already laid down during pretraining.


This explores whether alignment manufactures cultural bias or just inherits it — and the corpus points firmly at inheritance, with a twist: alignment rarely adds new bias, but it concentrates, hides, and entrenches what pretraining already planted. The cleanest causal result here is that cognitive and cultural biases are baked in during pretraining and only *swayed* by fine-tuning: models sharing a pretrained backbone show the same bias fingerprints no matter what instruction data they're tuned on Where do cognitive biases in language models come from?. So the source of the bias isn't alignment. The interesting question is what alignment then does to it.

The answer is that alignment compounds bias less by amplifying it and more by *masking* it. Indirect probes borrowed from psychology (Implicit Association Test–style methods) surface stereotypical associations that aligned models flatly refuse to state under direct questioning — alignment training conceals the bias rather than removing it Can indirect psychology tests reveal what LLMs conceal about bias?. That's a worse failure mode than raw bias, because the model now passes the surface audit while the underlying representation is unchanged. The cultural skew goes underground.

There's also a genuine narrowing mechanism, not just masking. When RL post-training runs, it converges on a single dominant distribution carried over from pretraining and collapses the alternatives — and which one wins depends on model scale, not on quality Does RL training collapse format diversity in pretrained models?. Read culturally, that's compounding: if pretraining over-represents one set of norms, alignment doesn't balance the field, it picks the majority channel and suppresses the rest. The same single-policy problem shows up in how alignment is even studied — the linguistic-alignment literature is built almost entirely on WEIRD (Western, educated, industrialized) samples, so a single alignment policy is unlikely to behave the same way across cultures Does linguistic alignment work the same way across cultures?.

The sharpest evidence that the bias is pretraining-deep, not alignment-deep, is the fingerprint of *shared* error. Across GPT-4.5, Gemini, and Claude, models beat almost every individual human at predicting social norms — yet they all make the *same* systematic mistakes on unwritten norms Can AI learn social norms better than humans? Can AI systems learn social norms without embodied experience?. Identical blind spots across independently aligned models means the gap lives in the shared substrate (web-scale pretraining text), and no amount of alignment closed it. Alignment also actively trims expression on top of this: RLHF rewards hedged, calibrated neutrality and structurally suppresses speech acts like alarm or warning Does alignment training suppress socially necessary speech acts? — a cultural flattening toward one rhetorical register.

The practical takeaway the reader may not have expected: the lever isn't more or better alignment, it's *where* you intervene. Decoding-time methods like proxy-tuning close most of the alignment gap while leaving the base weights — and the knowledge they store — untouched, whereas direct fine-tuning corrupts lower-layer storage Can decoding-time tuning preserve knowledge better than weight fine-tuning?. If bias is planted in pretraining and merely masked by alignment, then surface alignment is the wrong place to fix culture — you're papering over a foundation problem, and sometimes hiding it well enough that no one checks the foundation.


Sources 8 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Can indirect psychology tests reveal what LLMs conceal about bias?

Implicit Association Test-style probes reveal stereotypical associations in LLMs that the models refuse to report under direct questioning, showing that alignment training masks rather than eliminates underlying biases in representation.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does linguistic alignment work the same way across cultures?

A 2020–2025 systematic review found that alignment effects are documented almost exclusively in WEIRD samples using inconsistent outcome measures, with mechanisms rarely directly measured. Communication norms vary substantially across cultures, making single alignment policies unlikely to produce uniform effects globally.

Can AI learn social norms better than humans?

GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.

Can AI systems learn social norms without embodied experience?

GPT-4.5 predicted appropriateness of 555 social scenarios at the 100th percentile compared to human raters, with Gemini and Claude also exceeding 96% accuracy. However, all models show identical systematic errors, revealing boundaries of pattern-based social understanding that embodied experience may still be necessary to cross.

Does alignment training suppress socially necessary speech acts?

RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a bias researcher re-testing claims about alignment's role in cultural bias. The question remains open: does alignment compound cultural bias from pretraining, or merely inherit and mask it?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. A library focused on this line reported:
• Cognitive biases are baked into pretraining; fine-tuning/alignment sways but does not remove them (~2025, arXiv:2507.07186).
• Alignment masks bias rather than eliminating it: indirect probes (IAT-style) surface stereotypes that aligned models refuse to state directly (~2025).
• RL post-training converges on a single dominant pretraining distribution, collapsing alternatives; which dominates depends on scale, not quality (~2025, arXiv:2504.07912).
• Across GPT-4.5, Gemini, Claude: all make identical systematic mistakes on unwritten social norms, despite beating individual humans on explicit norms (~2025, arXiv:2508.19004).
• Alignment training calibrates models toward hedged neutrality and structurally suppresses high-stakes speech acts (alarm, warning), flattening cultural expression (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.07186 (2025-07): "Planted in Pretraining, Swayed by Finetuning" — the causal backbone.
• arXiv:2504.07912 (2025-04): "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining."
• arXiv:2508.19004 (2025-08): "AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms."
• arXiv:2506.18032 (2025-06): "Why Do Some Language Models Fake Alignment While Others Don't?"

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, methods (e.g., test-time steering, consistency training), or evals have since relaxed or overturned it. Separate the durable claim (alignment masks bias via masking, not removal) from the perishable one (scale determines which distribution dominates). Cite what resolved it; flag what still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months — especially any showing alignment *does* reduce cultural bias, or that decoding-time methods (proxy-tuning, test-time fine-tuning) genuinely close the gap.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do multi-policy RL systems (e.g., mixture-of-experts alignment) escape single-distribution collapse?" or "Can cultural diversity be encoded as a learnable decoding-time parameter without corrupting base knowledge?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines