Do explicit and implicit self-recognition use the same mechanism?
Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?
Self-recognition in language models is not one capability but two. The implicit form shows up in the output distribution — on-policy entropy collapses relative to off-policy, driven by an internal input-surprise signal the model never has to articulate. The explicit form is a verbal report: ask the model whether the preceding context is its own generation or a prefill, and it can answer. The striking finding is that these two abilities do not share a substrate. Explicit verbal self-recognition routes through a different mechanism than implicit recognition.
This dissociation is consequential for interpretability and self-report reliability. It means a model's verbal claim about whether it produced something is not simply reading off the same internal state that already governs its entropy. The two channels can in principle diverge: a model might verbally deny authorship while its distribution betrays on-policy confidence, or vice versa. The clean alignment between "what the model says about itself" and "what the model's internals encode" cannot be assumed.
Why it matters: much of the case for using model self-reports as a window into internal states presumes those reports tap the relevant mechanism. This result is a concrete counterexample within a single capability — self-recognition — where the verbal channel and the implicit channel are mechanistically separate. It tempers optimism about introspective self-report as a transparency tool: even when a model is right about itself, it may be right through a path disconnected from the process actually doing the work, which is exactly the failure mode that makes self-reports hard to trust as evidence.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can self-description of internal states influence consciousness attribution?
- Can language model self-reports diverge from their internal entropy signals?
- What separates behavioral self-awareness from genuine introspective capability?
- What distinguishes performative self-reports from genuine introspective access in models?
- Why do verbal self-reports disconnect from implicit recognition in the same system?
- How does continuous implicit memory formation differ from explicit memory encoding?
- How do pairwise self-judgment and internal belief-shift replace verification differently?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do models produce less uncertain outputs on their own text?
Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
the implicit channel whose mechanism the explicit verbal channel does not share
-
Can language models actually introspect about their own states?
Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.
the explicit-implicit dissociation is a within-capability instance of self-reports not always tracking the causal internal state
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
both probe the boundary between behavioral capability and genuine introspective access
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- Does It Make Sense to Speak of Introspection in Large Language Models?
- Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
- From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations
- Emergent Introspective Awareness in Large Language Models
- Mechanisms of Introspective Awareness
- Tell me about yourself: LLMs are aware of their learned behaviors
- How do Transformers Learn Implicit Reasoning?
Original note title
explicit verbal self-recognition routes through a different mechanism than implicit recognition