What structural biases does transformer attention have before training?
This explores which biases live in transformer attention by design — baked into the architecture itself — versus the ones we tend to blame on the architecture but that actually get planted during training.
This explores which biases are wired into transformer attention before any learning happens, as opposed to ones acquired from data. The corpus points to a few that really are structural. First, soft attention systematically over-weights tokens that are repeated or prominent in the context, regardless of whether they're relevant — a built-in positive feedback loop that amplifies whatever framing is already present, and one researchers link to sycophancy long before RLHF gets involved Does transformer attention architecture inherently favor repeated content?. Second, attention aggregates tokens additively rather than selectively: it sums weighted contributions in parallel instead of suppressing the irrelevant ones the way human comprehension does. That's why models miss jokes, wordplay, and frame-dependent meaning so reliably — not a knowledge gap but a missing cognitive operation that falls straight out of how attention combines information Why do AI systems miss jokes and wordplay so consistently?.
There's also a stranger structural feature: a tiny handful of input-agnostic 'massive activations' — values up to 100,000× larger than their neighbors — that act as implicit attention bias terms, parking attention probability on particular tokens no matter what the input says. These show up across model sizes and even in vision transformers, suggesting they're a property of the mechanism rather than the dataset Do hidden massive activations act as attention bias terms?. In the same architectural-fingerprint vein, transformers treat knowledge as something that flows through the residual stream during generation rather than something stored and retrieved, which is why their 'knowledge' is contextual and hard to edit Do transformer models store knowledge or generate it continuously?.
But the more interesting move is lateral: a lot of what feels like architectural bias is actually planted by pretraining, not present at initialization. A causal study using random-seed variation and cross-tuning found that cognitive biases come from pretraining and are merely nudged by finetuning — models sharing a backbone share bias patterns regardless of instruction data Where do cognitive biases in language models come from?. Similarly, the sparse-vs-dense character of a model's internal representations isn't intrinsic; networks learn dense activations for familiar data and stay sparse for unfamiliar inputs as exposure accumulates Is representational sparsity learned or intrinsic to neural networks?. So 'structural' splits into two camps — the attention mechanism's own tilts (repetition, additive aggregation, massive-activation sinks) versus the priors that training writes on top of it.
That distinction matters because the two kinds of bias respond to different fixes. The genuinely architectural ones get patched at inference or with mechanism changes — regenerating context to strip irrelevant material can interrupt the repetition loop, for instance Does transformer attention architecture inherently favor repeated content? — while learned priors can be so strong they override information sitting right in the context window, and plain prompting won't dislodge them; you need to intervene in the representations themselves Why do language models ignore information in their context?. The takeaway you might not have gone looking for: attention isn't a neutral blank slate that training fills in — it arrives with a thumb already on the scale, and then training presses down harder in specific places.
Sources 7 notes
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.
A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.