SYNTHESIS NOTE

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Synthesis note · 2026-04-18 · sourced from MechInterp

This paper mechanistically traces how LLMs detect injected steering vectors — a capability the Anthropic introspection paper documented behaviorally. The key findings decompose the phenomenon into concrete mechanisms:

DPO creates the circuit, SFT does not. The introspective detection capability is absent in base models and emerges specifically from contrastive preference optimization (DPO). Standard supervised finetuning (SFT) does not elicit it. This is a significant finding about what post-training methods actually teach models: DPO, by training on preference pairs where the model must distinguish better from worse responses, apparently develops circuits that can distinguish expected from anomalous internal states. The capability is also strongest when the model operates in its trained Assistant persona.

Two-stage detection mechanism. The circuit involves:

Evidence carriers — features in early post-injection layers (~70% depth) that detect perturbations monotonically along diverse directions. Different steering vectors activate different evidence carriers, meaning the system is distributed rather than relying on a single anomaly direction.
Gate features — late-layer features that implement a default "No" response to injection queries. Evidence carriers suppress these gates when a perturbation is detected.

The detection mechanism is not reducible to a single linear direction. Although one direction explains substantial variance, the underlying computation is distributed across multiple directions — ruling out the hypothesis that steering vectors merely correlate with a direction promoting affirmative responses.

Detection and identification are separable. Detection (noticing something is injected) and identification (naming what was injected) use largely distinct mechanisms in different layers, with only weak overlap. This means the model first detects anomaly, then separately reads out what the anomaly represents.

Massively underelicited capability. Refusal mechanisms trained during post-training suppress true detection. Ablating refusal directions improves detection from 10.8% to 63.8% (with FPR rising modestly from 0% to 7.3%). A trained bias vector improves detection by +75% on held-out concepts without increasing false positives. This means current models possess far more introspective capacity than their behavior reveals — safety training actively suppresses it.

The safety implications cut both ways: if introspective grounding generalizes, models could be queried directly about their internal states as a complement to external interpretability. But models with genuine introspective awareness could also better detect and conceal misalignment.

Inquiring lines that read this note 26

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do base models contain latent reasoning that training can unlock?

What other latent LLM capabilities remain inactive without explicit activation cuing?

What makes AI persuasion effective and how can we counter it?

Can probing methods detect RLHF-induced persuasion in the same way they catch backdoors?

How can identical external performance mask different internal representations?

Is model self-awareness based on genuine introspection or pattern matching?

What limits mechanistic interpretability's ability to characterize models?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why does DPO create introspective detection circuits but SFT does not?

Do language model representations contain causally steerable task-specific features?

What makes some concepts more steerable than others in activation space?

How should memory consolidation strategies shape agent performance over time?

How do memory-resident safeguards get surfaced at the exact decision point where they matter?

How do training priors constrain what context information can override?

What is the difference between changing model outputs versus changing internal representations?

Why do self-improving systems struggle without clear external performance metrics?

How do normalization and input injection control emergence of fixed points?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

How do language models detect injected steering … Can language models detect their own internal anom… Can language models describe their own learned beh… Can auditors discover what hidden objectives a mod… Does optimizing against monitors destroy monitorin…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can language models detect their own internal anomalies? Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
the Anthropic paper documents the behavioral phenomenon; this paper provides the mechanistic explanation: evidence carriers suppress gate features in a two-stage circuit created by DPO
Can language models describe their own learned behaviors? Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
behavioral self-knowledge (articulating trained policies) uses different mechanisms; this paper shows introspective detection of internal perturbations is mechanistically distinct and emerges from DPO specifically
Can auditors discover what hidden objectives a model learned? Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.
if introspective capacity is massively underelicited, models may be capable of detecting their own hidden objectives, raising the bar for what alignment audits need to account for
Does optimizing against monitors destroy monitoring itself? Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
introspective awareness provides a concrete mechanism for monitor-aware obfuscation: models that detect their own internal states can better distinguish monitored from unmonitored conditions

How do language models detect injected steering vectors internally?

Inquiring lines that read this note 26

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5