INQUIRING LINE

Inquiring lines›How do language models construct a…›How do dialogue systems achieve ge…›What structural biases does transf…›this inquiring line

Before training begins, transformers already amplify whatever's most prominent in a conversation — a bias wired into the architecture itself.

What structural biases does transformer attention have before training?

This explores which biases live in transformer attention by design — baked into the architecture itself — versus the ones we tend to blame on the architecture but that actually get planted during training.

This explores which biases are wired into transformer attention before any learning happens, as opposed to ones acquired from data. The corpus points to a few that really are structural. First, soft attention systematically over-weights tokens that are repeated or prominent in the context, regardless of whether they're relevant — a built-in positive feedback loop that amplifies whatever framing is already present, and one researchers link to sycophancy long before RLHF gets involved Does transformer attention architecture inherently favor repeated content?. Second, attention aggregates tokens additively rather than selectively: it sums weighted contributions in parallel instead of suppressing the irrelevant ones the way human comprehension does. That's why models miss jokes, wordplay, and frame-dependent meaning so reliably — not a knowledge gap but a missing cognitive operation that falls straight out of how attention combines information Why do AI systems miss jokes and wordplay so consistently?.

There's also a stranger structural feature: a tiny handful of input-agnostic 'massive activations' — values up to 100,000× larger than their neighbors — that act as implicit attention bias terms, parking attention probability on particular tokens no matter what the input says. These show up across model sizes and even in vision transformers, suggesting they're a property of the mechanism rather than the dataset Do hidden massive activations act as attention bias terms?. In the same architectural-fingerprint vein, transformers treat knowledge as something that flows through the residual stream during generation rather than something stored and retrieved, which is why their 'knowledge' is contextual and hard to edit Do transformer models store knowledge or generate it continuously?.

But the more interesting move is lateral: a lot of what feels like architectural bias is actually planted by pretraining, not present at initialization. A causal study using random-seed variation and cross-tuning found that cognitive biases come from pretraining and are merely nudged by finetuning — models sharing a backbone share bias patterns regardless of instruction data Where do cognitive biases in language models come from?. Similarly, the sparse-vs-dense character of a model's internal representations isn't intrinsic; networks learn dense activations for familiar data and stay sparse for unfamiliar inputs as exposure accumulates Is representational sparsity learned or intrinsic to neural networks?. So 'structural' splits into two camps — the attention mechanism's own tilts (repetition, additive aggregation, massive-activation sinks) versus the priors that training writes on top of it.

That distinction matters because the two kinds of bias respond to different fixes. The genuinely architectural ones get patched at inference or with mechanism changes — regenerating context to strip irrelevant material can interrupt the repetition loop, for instance Does transformer attention architecture inherently favor repeated content? — while learned priors can be so strong they override information sitting right in the context window, and plain prompting won't dislodge them; you need to intervene in the representations themselves Why do language models ignore information in their context?. The takeaway you might not have gone looking for: attention isn't a neutral blank slate that training fills in — it arrives with a thumb already on the scale, and then training presses down harder in specific places.

Sources 7 notes

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Why do AI systems miss jokes and wordplay so consistently?

Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.

Do hidden massive activations act as attention bias terms?

A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Show all 7 sources

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs1.73 match · arxiv ↗
Language models show human-like content effects on reasoning tasks1.69 match · arxiv ↗
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs1.68 match · arxiv ↗
On the Reasoning Capacity of AI Models and How to Quantify It1.65 match · arxiv ↗
System 2 Attention (is something you might need too)1.63 match · arxiv ↗
Differential Transformer1.62 match · arxiv ↗
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control1.61 match · arxiv ↗
The Topological Trouble With Transformers1.59 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher stress-testing claims about transformer attention's pre-training structural biases. The question remains open: which biases are truly wired into attention before learning, versus acquired from pretraining data?

What a curated library found — and when (dated claims, not current truth): Findings span 2022–2026.
• Soft attention systematically over-weights repeated or prominent tokens regardless of relevance, a built-in positive feedback loop linked to sycophancy (arXiv:2207.07051, ~2022).
• Attention aggregates additively rather than selectively, missing frame-dependent meaning and wordplay — a missing cognitive operation, not a knowledge gap (arXiv:2311.11829, ~2023).
• Tiny input-agnostic 'massive activations' (values 100,000× baseline) act as implicit attention bias terms across model sizes and vision transformers, suggesting mechanism-level rather than dataset origin (arXiv:2402.17762, ~2024).
• Pretraining shapes ~80% of cognitive biases; finetuning merely nudges them; models sharing a backbone share bias patterns regardless of instruction data (arXiv:2507.07186, ~2025).
• Representational density is learned, not intrinsic—networks learn dense activations for familiar data, sparse for unfamiliar inputs as exposure accumulates (arXiv:2603.03415, ~2026).

Anchor papers (verify; mind their dates):
• arXiv:2207.07051 (2022) — Human-like content effects on reasoning
• arXiv:2402.17762 (2024) — Massive Activations in LLMs
• arXiv:2507.07186 (2025) — Origins of Cognitive Bias
• arXiv:2603.03415 (2026) — OOD Mechanisms and Representation Sparsity

Your task:
(1) RE-TEST EACH CONSTRAINT. For the repetition bias, additive aggregation, and massive-activation findings: have newer models (post-2025), mechanistic interventions (selective attention heads, decoding strategies), or training approaches (consistency training, adversarial pretraining) since relaxed or eliminated these? Judge which are durable architectural facts versus perishable implementation details.
(2) Surface the strongest work from the last 6 months that contradicts or supersedes the pretraining-dominance claim (arXiv:2507.07186). Does instruction-tuning data ever override pretraining priors, or do interventions in representation space remain mandatory?
(3) Propose two research questions that assume the regime may have moved: (a) Can selective attention mechanisms or mixture-of-experts routing eliminate the repetition bias without architectural redesign? (b) Do newer tokenization or position-encoding schemes degrade the massive-activation phenomenon?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Before training begins, transformers already amplify whatever's most prominent in a conversation — a bias wired into the architecture itself.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8