How do induction heads learn to overwrite computational representations?
This explores how transformers overwrite their own internal computations mid-forward-pass — though the corpus speaks to the broader mechanics of representation overwriting rather than to induction heads specifically.
This reads the question as asking about the mechanics of overwriting computational representations inside a transformer — the moment where a model computes one thing and then replaces it with another before producing output. Worth flagging up front: the collection doesn't have a note specifically on induction heads (the attention circuits that copy-and-continue patterns). But it has something arguably more interesting on the overwriting half of the question, and that's where the real surprise lives.
The sharpest evidence is that transformers will compute a correct answer early and then actively bury it. When models are trained to emit filler or format-compliant tokens, logit-lens analysis shows the genuine reasoning appears in layers 1–3, and the final layers suppress those representations to produce the required output — yet the buried answer stays fully recoverable in the lower-ranked token predictions Do transformers hide reasoning before producing filler tokens?. So overwriting here isn't erasure; it's one representation winning a competition over another while the loser lingers underneath. That reframes the whole question: the model isn't learning to delete, it's learning which signal to surface.
That competition shows up again when context loses to training. Language models often ignore information sitting right in their prompt because parametric knowledge from pretraining dominates the in-context signal — and crucially, prompting alone can't fix it; you need causal intervention directly in the representations to flip which one wins Why do language models ignore information in their context?. Same dynamic as the hidden-reasoning case: a strong prior overwrites a weaker live computation. If you wanted to understand why a copy-style mechanism sometimes fails to fire, this is the adjacent failure mode worth knowing about.
There's a deeper point about whether these capabilities are learned or merely selected. Five independent methods all elicit reasoning that already sits latent in base-model activations — post-training selects rather than creates it Do base models already contain hidden reasoning ability?. Pair that with the finding that networks naturally carve compositional tasks into isolated, ablatable subnetworks Do neural networks naturally learn modular compositional structure?, and "learning to overwrite" starts to look less like building a new circuit and more like training reweighting which existing module's output reaches the residual stream.
The thing you might not have known you wanted: overwriting in transformers is reversible from the outside. Because the suppressed representation survives in lower-ranked logits, a model that's been trained to hide its work can be read anyway — and representational density itself is shaped by how familiar the input is, with dense activations for seen data and sparse defaults for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?. Overwriting, in other words, is a surface phenomenon layered over a substrate that keeps the record.
Sources 5 notes
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.