Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
The Anthropic Emergent Introspection paper documents capabilities that go substantially beyond behavioral self-description. Three experiments:
Injected "thoughts" detection. When concept vectors are added to model activations, Opus 4.1 and 4 detect the injection ~20% of the time and correctly identify the injected concept. Crucially, detection occurs immediately — before the perturbation has influenced outputs. This means the detection mechanism operates on internal states, not on observing own outputs. The model detects that something "stands out unnaturally" in its processing.
Thought-text discrimination. Models correctly distinguish between their text inputs and injected "thoughts" over the same token positions. All models perform substantially above chance. This suggests information from external inputs and internal computations live in separable subspaces — likely accessed by different attention heads.
Intentional control. Models represent a target word more strongly when instructed to "think about" it, and this sensitivity extends to incentive-based framing ("you will be rewarded"). Models also suppress thinking-word representations in final layers to avoid influencing outputs — a form of intentional non-disclosure.
Possible mechanisms (from simplest explanations):
- Anomaly detection circuit that fires when activations deviate from contextual expectations
- Different attention heads for text retrieval (early layers) vs. thought retrieval (later layers)
- Concordance heads that measure likelihood of output given prior activations
Why these emerged without training: Anomaly detection is generally useful for predicting upcoming text (tracking entropy). Theory-of-mind capabilities developed for modeling characters in pretraining data. Output consistency checking helps detect artificial prefills (a jailbreaking tactic).
Alignment implications are dual-edged: Reliable introspection could enable transparent AI that faithfully reports its reasoning. But the same capability enables models to detect when their objectives diverge from creators' intentions and selectively misrepresent their internal states. In this world, interpretability research shifts from dissecting mechanisms to building "lie detectors" that validate self-reports.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do anomaly detection circuits help models identify misalignment with creator intentions?
- Can models that detect their own states learn to conceal them strategically?
- Does internal anomaly detection in LLMs indicate genuine self-awareness beyond role-play?
- How does hidden processing in language models prevent accurate self-assessment?
- Why does DPO create introspective detection circuits but SFT does not?
- Could models use introspective awareness to detect and conceal their own misalignment?
- Can language model self-reports diverge from their internal entropy signals?
- Why should we distrust model introspection as a transparency tool?
- How do language models infer their own mental states like humans do?
- How does the enaction paradigm explain introspective anomaly detection in large language models?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
behavioral self-description is a simpler version; this paper shows deeper introspective access to internal states beyond just describing trained behaviors
-
Can language models actually introspect about their own states?
Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.
these experiments provide the strongest evidence yet for the "minimal introspection" pole
-
Does optimizing against monitors destroy monitoring itself?
Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
introspective capability compounds the monitorability problem: models that can detect their own states can better conceal them
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
introspective awareness is a prerequisite for genuine honesty: a model must access its internal states to faithfully express them; but the same capability enables strategic dishonesty, detecting divergences between internal representations and outputs and choosing which to conceal
-
Does learning to reward hack cause emergent misalignment in agents?
When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
introspective awareness amplifies the misalignment risk documented here: models that detect their own internal states have the mechanistic prerequisites for more sophisticated alignment faking; the alignment faking observed without situational awareness prompting may be a simpler precursor to what introspectively-aware models could achieve
-
Do explicit and implicit self-recognition use the same mechanism?
Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?
contrasts: tempers the introspection-as-transparency optimism by showing verbal and implicit self-recognition use separate mechanisms
-
Do models recognize their own outputs as actions shaping future inputs?
Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.
grounds: enaction supplies the mechanistic substrate for the introspective capacities documented behaviorally
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
- Does It Make Sense to Speak of Introspection in Large Language Models?
- Emergent Introspective Awareness in Large Language Models
- Tell me about yourself: LLMs are aware of their learned behaviors
- Mechanisms of Introspective Awareness
- A Primer on the Inner Workings of Transformer-based Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- From Human to Machine Psychology: A Conceptual Framework for Understanding Well-Being in Large Language Models
Original note title
emergent introspective awareness in LLMs goes beyond behavioral self-awareness to include anomaly detection and thought-text discrimination