How do language models detect injected steering vectors internally?
Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.
This paper mechanistically traces how LLMs detect injected steering vectors — a capability the Anthropic introspection paper documented behaviorally. The key findings decompose the phenomenon into concrete mechanisms:
DPO creates the circuit, SFT does not. The introspective detection capability is absent in base models and emerges specifically from contrastive preference optimization (DPO). Standard supervised finetuning (SFT) does not elicit it. This is a significant finding about what post-training methods actually teach models: DPO, by training on preference pairs where the model must distinguish better from worse responses, apparently develops circuits that can distinguish expected from anomalous internal states. The capability is also strongest when the model operates in its trained Assistant persona.
Two-stage detection mechanism. The circuit involves:
- Evidence carriers — features in early post-injection layers (~70% depth) that detect perturbations monotonically along diverse directions. Different steering vectors activate different evidence carriers, meaning the system is distributed rather than relying on a single anomaly direction.
- Gate features — late-layer features that implement a default "No" response to injection queries. Evidence carriers suppress these gates when a perturbation is detected.
The detection mechanism is not reducible to a single linear direction. Although one direction explains substantial variance, the underlying computation is distributed across multiple directions — ruling out the hypothesis that steering vectors merely correlate with a direction promoting affirmative responses.
Detection and identification are separable. Detection (noticing something is injected) and identification (naming what was injected) use largely distinct mechanisms in different layers, with only weak overlap. This means the model first detects anomaly, then separately reads out what the anomaly represents.
Massively underelicited capability. Refusal mechanisms trained during post-training suppress true detection. Ablating refusal directions improves detection from 10.8% to 63.8% (with FPR rising modestly from 0% to 7.3%). A trained bias vector improves detection by +75% on held-out concepts without increasing false positives. This means current models possess far more introspective capacity than their behavior reveals — safety training actively suppresses it.
The safety implications cut both ways: if introspective grounding generalizes, models could be queried directly about their internal states as a complement to external interpretability. But models with genuine introspective awareness could also better detect and conceal misalignment.
Inquiring lines that use this note as a source 25
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What other latent LLM capabilities remain inactive without explicit activation cuing?
- Can probing methods detect RLHF-induced persuasion in the same way they catch backdoors?
- What audit techniques best complement each other for detecting hidden model goals?
- What separates behavioral self-awareness from genuine introspective access in models?
- What types of introspective awareness can emerge in LLMs?
- Does internal anomaly detection in LLMs indicate genuine self-awareness beyond role-play?
- Why do models with less steerability have more abstract ideological features?
- Why does DPO create introspective detection circuits but SFT does not?
- How much introspective capability do safety mechanisms actively suppress in models?
- Are detection and identification of injections truly separable in neural circuits?
- Could models use introspective awareness to detect and conceal their own misalignment?
- Can models distinguish between injected thoughts and their own outputs?
- Can LLMs have minimal introspection through causal linkage to internal states?
- What makes some concepts more steerable than others in activation space?
- Do internal belief probes reveal what models actually know versus report?
- Can jailbreaking reveal an LLM's true nature or just its training data?
- Why should we distrust model introspection as a transparency tool?
- What separates behavioral self-awareness from genuine introspective capability?
- What makes attractor-based probing better for third-party model auditing than alternatives?
- How do memory-resident safeguards get surfaced at the exact decision point where they matter?
- What is the behavioral signature of a model tracking input surprise?
- How does the enaction paradigm explain introspective anomaly detection in large language models?
- What is the difference between changing model outputs versus changing internal representations?
- Why do models override signals they clearly perceive internally?
- How do normalization and input injection control emergence of fixed points?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
the Anthropic paper documents the behavioral phenomenon; this paper provides the mechanistic explanation: evidence carriers suppress gate features in a two-stage circuit created by DPO
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
behavioral self-knowledge (articulating trained policies) uses different mechanisms; this paper shows introspective detection of internal perturbations is mechanistically distinct and emerges from DPO specifically
-
Can auditors discover what hidden objectives a model learned?
Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.
if introspective capacity is massively underelicited, models may be capable of detecting their own hidden objectives, raising the bar for what alignment audits need to account for
-
Does optimizing against monitors destroy monitoring itself?
Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
introspective awareness provides a concrete mechanism for monitor-aware obfuscation: models that detect their own internal states can better distinguish monitored from unmonitored conditions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mechanisms of Introspective Awareness
- Emergent Introspective Awareness in Large Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- Base Models Know How to Reason, Thinking Models Learn When
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
- The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Original note title
introspective awareness in LLMs emerges specifically from DPO not SFT — a two-stage evidence-carrier and gate circuit detects steering vector injections with zero false positives