Can language models actually introspect about their own states?
Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.
The question "can LLMs introspect?" has been stuck in a binary: either they have privileged access to their own states (implausible) or their self-reports are pure confabulation (too dismissive). The introspection paper proposes a third position — a "lightweight conception of introspection" that requires neither consciousness nor immediacy, only a causal process linking an internal state to an accurate self-report.
Two examples make the distinction concrete. When asked to describe the process behind its creative writing, an LLM claims to have "read the poem aloud several times" — an action it cannot perform. This self-report reflects the distribution of human self-reports in training data, not any actual internal process. It fails the causal linkage test because the content of the report has no pathway to the LLM's actual generation mechanism.
However, when Gemini is asked to estimate whether its sampling temperature is high or low, and given appropriate scaffolding (being told it is an LLM with a temperature parameter), it correctly infers "relatively low" by reasoning about the characteristics of its own recent outputs — consistency, accuracy, focus. The causal chain here is plausible: the model's outputs at low temperature have statistical properties (lower variance, more predictable) that the model can detect in its own generation history and accurately report on.
This conception aligns with "internally-directed theory of mind" accounts of human introspection — where the same theory-of-mind apparatus used to infer others' mental states gets turned back on one's own behavior. The model is not directly accessing its internal states but inferring them from observable consequences, which is also what many philosophers argue humans do.
The practical implication: LLM self-reports should not be uniformly trusted or dismissed. The discriminating question is whether a plausible causal pathway exists between the reported internal state and the generation of the report. Most self-reports about "thinking" or "feeling" fail this test. Some self-reports about detectable operational parameters may pass it.
Inquiring lines that use this note as a source 63
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does self-observation enable experts to verify their own judgment?
- Can LLMs evaluate their own observations without external feedback?
- Can systems lacking inner states express genuine truthfulness claims?
- How does automated transcript analysis compare to patient self-report on engagement?
- Why does self-critiquing actually reduce plan quality in language models?
- Do LLMs genuinely internalize human psychological structure or match surface patterns?
- Can self-description of internal states influence consciousness attribution?
- Why do users attribute consciousness to language models in practice?
- Do models actually self-assess their confidence or just confirm answers?
- What audit techniques best complement each other for detecting hidden model goals?
- Can LLMs infer psychological profiles without explicit user disclosure?
- Can we use folk-psychology without committing to genuine mental states?
- How do implicit world models and self-reflection operationalize consequence-based learning?
- What role does authentic self-expression play in building accurate personality models?
- What separates behavioral self-awareness from genuine introspective access in models?
- What types of introspective awareness can emerge in LLMs?
- Does internal anomaly detection in LLMs indicate genuine self-awareness beyond role-play?
- How does hidden processing in language models prevent accurate self-assessment?
- Does inner subjective experience matter for discourse participation?
- Why does entity recognition act as a self-knowledge mechanism in LLMs?
- Why does optimism bias disappear when LLMs passively observe outcomes?
- Why does DPO create introspective detection circuits but SFT does not?
- Could models use introspective awareness to detect and conceal their own misalignment?
- Can models distinguish between injected thoughts and their own outputs?
- Does behavioral self-awareness depend on genuine introspection or statistical pattern matching?
- Do external perspectives fix the self-evaluation bias in language models?
- How do internal representations compare to human cognitive structures?
- Why do LLMs inherit causal biases from their training data?
- What does zero-shot psychological profiling reveal about language model representations?
- Do causal histories determine what mental states a system can instantiate?
- Can LLMs have minimal introspection through causal linkage to internal states?
- Why does self-reflection during training fail to improve model self-correction?
- Can behavioral self-awareness in LLMs extend to recognizing their own contradictions?
- What behavioral markers distinguish realized quasi-states from pretended ones?
- Does self-reflection help models notice their own constraint violations?
- Where does the LLM interlocutor actually exist in the system?
- Do internal belief probes reveal what models actually know versus report?
- Can language models learn internal world models without explicit environment specifications?
- Why do LLMs succeed at social roles without a stable self?
- Can jailbreaking reveal an LLM's true nature or just its training data?
- Can implicit association tests reveal LLM biases beneath trained responses?
- Why does self-judgment of success or failure work without ground truth labels?
- Can language model self-reports diverge from their internal entropy signals?
- Why should we distrust model introspection as a transparency tool?
- What separates behavioral self-awareness from genuine introspective capability?
- What distinguishes performative self-reports from genuine introspective access in models?
- How do language models infer their own mental states like humans do?
- Why do verbal self-reports disconnect from implicit recognition in the same system?
- Can models detect statistical properties of their own generation in real time?
- Can external retrieval signals outperform internal self-assessment during revision?
- Do models spontaneously develop self-reflection from minimal training signals?
- Why does self-distillation suppress epistemic verbalization in student models?
- Can models detect when their own trajectory is on-policy versus off-policy?
- How does the enaction paradigm explain introspective anomaly detection in large language models?
- Do models verbalize their implicit knowledge when that knowledge influences their output?
- How can we probe LLM representations in channels that training did not target?
- What is the difference between changing model outputs versus changing internal representations?
- Why do models override signals they clearly perceive internally?
- Does external critique guide revision better than internal self-assessment during model training?
- How do pairwise self-judgment and internal belief-shift replace verification differently?
- How does attention sink behavior relate to internal model architecture?
- How do LLM explanations diverge from actual internal reasoning?
- What prevents LLM representations from causally influencing generation outputs?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do LLMs develop the same kind of mind as humans?
Explores whether LLMs and humans share the intersubjective linguistic training that shapes cognition, and whether that shared training produces equivalent forms of agency and reflexivity.
the introspection finding adds a specific mechanism: introspective access is possible for operational states but not for experiential ones
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
complementary finding: behavioral self-awareness emerges, but this paper adds the causal-linkage criterion for distinguishing genuine from performative self-reports
-
Do reasoning models actually use the hints they receive?
This explores whether language models acknowledge reasoning hints in their explanations when those hints causally influence their answers. Understanding this gap matters for evaluating whether chain-of-thought explanations can be trusted for safety monitoring.
the inverse case: reasoning models fail to report on processes that ARE causally influencing them
-
Do explicit and implicit self-recognition use the same mechanism?
Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?
extends: a concrete within-capability counterexample where the verbal channel is mechanistically disconnected from the implicit state it reports on
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Does It Make Sense to Speak of Introspection in Large Language Models?
- Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- “Understanding AI”: Semantic Grounding in Large Language Models
- Tell me about yourself: LLMs are aware of their learned behaviors
- Mechanisms of Introspective Awareness
- When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
- Emergent Introspective Awareness in Large Language Models
Original note title
llm self-reports mostly reflect training data distributions not introspection — but minimal introspection is possible when self-reports causally link to internal states