INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›Does recurrence enable reasoning c…›this inquiring line

A model can reach the right answer internally, then overwrite it before you ever see it.

How do induction heads learn to overwrite computational representations?

This explores how transformers overwrite their own internal computations mid-forward-pass — though the corpus speaks to the broader mechanics of representation overwriting rather than to induction heads specifically.

This reads the question as asking about the mechanics of overwriting computational representations inside a transformer — the moment where a model computes one thing and then replaces it with another before producing output. Worth flagging up front: the collection doesn't have a note specifically on induction heads (the attention circuits that copy-and-continue patterns). But it has something arguably more interesting on the overwriting half of the question, and that's where the real surprise lives.

The sharpest evidence is that transformers will compute a correct answer early and then actively bury it. When models are trained to emit filler or format-compliant tokens, logit-lens analysis shows the genuine reasoning appears in layers 1–3, and the final layers suppress those representations to produce the required output — yet the buried answer stays fully recoverable in the lower-ranked token predictions Do transformers hide reasoning before producing filler tokens?. So overwriting here isn't erasure; it's one representation winning a competition over another while the loser lingers underneath. That reframes the whole question: the model isn't learning to delete, it's learning which signal to surface.

That competition shows up again when context loses to training. Language models often ignore information sitting right in their prompt because parametric knowledge from pretraining dominates the in-context signal — and crucially, prompting alone can't fix it; you need causal intervention directly in the representations to flip which one wins Why do language models ignore information in their context?. Same dynamic as the hidden-reasoning case: a strong prior overwrites a weaker live computation. If you wanted to understand why a copy-style mechanism sometimes fails to fire, this is the adjacent failure mode worth knowing about.

There's a deeper point about whether these capabilities are learned or merely selected. Five independent methods all elicit reasoning that already sits latent in base-model activations — post-training selects rather than creates it Do base models already contain hidden reasoning ability?. Pair that with the finding that networks naturally carve compositional tasks into isolated, ablatable subnetworks Do neural networks naturally learn modular compositional structure?, and "learning to overwrite" starts to look less like building a new circuit and more like training reweighting which existing module's output reaches the residual stream.

The thing you might not have known you wanted: overwriting in transformers is reversible from the outside. Because the suppressed representation survives in lower-ranked logits, a model that's been trained to hide its work can be read anyway — and representational density itself is shaped by how familiar the input is, with dense activations for seen data and sparse defaults for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?. Overwriting, in other words, is a surface phenomenon layered over a substrate that keeps the record.

Sources 5 notes

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs1.72 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.71 match · arxiv ↗
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning1.64 match · arxiv ↗
Break It Down: Evidence for Structural Compositionality in Neural Networks0.95 match · arxiv ↗
Scaling can lead to compositional generalization0.92 match · arxiv ↗
Understanding Hidden Computations in Chain-of-Thought Reasoning0.89 match · arxiv ↗
Base Models Know How to Reason, Thinking Models Learn When0.88 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher tasked with re-evaluating whether induction heads and representational overwriting in transformers remain constrained by findings from 2023–2026, or whether newer models, training regimes, or evaluation methods have relaxed those constraints.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and include:
• Transformers compute correct answers in early layers, then actively suppress them in final layers to produce format-compliant output; the buried answer remains recoverable in lower-ranked logits (~2024–2025).
• Language models ignore in-context information when parametric knowledge from pretraining dominates; prompting alone cannot flip this — causal intervention in representations is required (~2024).
• Five independent methods all elicit reasoning already latent in base-model activations; post-training selects rather than creates capability (~2024).
• Networks naturally decompose compositional tasks into isolated, ablatable subnetworks without explicit supervision (~2023).
• Representational density is shaped by training-data familiarity: dense activations for seen data, sparse defaults for OOD inputs (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023-01): Break It Down — structural compositionality in neural networks.
• arXiv:2412.04537 (2024-12): Understanding Hidden Computations in Chain-of-Thought Reasoning.
• arXiv:2504.09522 (2025-04): How new data permeates LLM knowledge.
• arXiv:2603.03415 (2026-03): Farther the Shift, Sparser the Representation — OOD mechanisms.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, determine whether improvements in model scale, multi-head attention capacity, training objectives (e.g., process reward vs. outcome reward), interpretability tooling (SAE, logit-lens variants), or in-context learning mechanisms have since relaxed or overturned it. Separate the durable question (does the model learn to suppress representations?) from the perishable limitation (only prompting can't fix it; causal intervention required). Cite what broke the constraint or confirm it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any showing that induction heads or copy mechanisms actively bypass representational overwriting, or that in-context signals now robustly outcompete pretraining.
(3) Propose 2 research questions that ASSUME the regime has shifted: e.g., "If models now preserve multiple competing representations simultaneously, how do they arbitrate between them at inference time?" or "Can we design training that prevents parametric-knowledge dominance without causal intervention?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

A model can reach the right answer internally, then overwrite it before you ever see it.

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8