SYNTHESIS NOTE

Can language models actually introspect about their own states?

Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

The question "can LLMs introspect?" has been stuck in a binary: either they have privileged access to their own states (implausible) or their self-reports are pure confabulation (too dismissive). The introspection paper proposes a third position — a "lightweight conception of introspection" that requires neither consciousness nor immediacy, only a causal process linking an internal state to an accurate self-report.

Two examples make the distinction concrete. When asked to describe the process behind its creative writing, an LLM claims to have "read the poem aloud several times" — an action it cannot perform. This self-report reflects the distribution of human self-reports in training data, not any actual internal process. It fails the causal linkage test because the content of the report has no pathway to the LLM's actual generation mechanism.

However, when Gemini is asked to estimate whether its sampling temperature is high or low, and given appropriate scaffolding (being told it is an LLM with a temperature parameter), it correctly infers "relatively low" by reasoning about the characteristics of its own recent outputs — consistency, accuracy, focus. The causal chain here is plausible: the model's outputs at low temperature have statistical properties (lower variance, more predictable) that the model can detect in its own generation history and accurately report on.

This conception aligns with "internally-directed theory of mind" accounts of human introspection — where the same theory-of-mind apparatus used to infer others' mental states gets turned back on one's own behavior. The model is not directly accessing its internal states but inferring them from observable consequences, which is also what many philosophers argue humans do.

The practical implication: LLM self-reports should not be uniformly trusted or dismissed. The discriminating question is whether a plausible causal pathway exists between the reported internal state and the generation of the report. Most self-reports about "thinking" or "feeling" fail this test. Some self-reports about detectable operational parameters may pass it.

Inquiring lines that read this note 65

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does self-reflection enable models to reliably correct their errors?

How do evaluation biases undermine LLM quality assessment systems?

Is model self-awareness based on genuine introspection or pattern matching?

Why does self-revision increase model confidence while degrading accuracy?

Do language models develop causal world models or rely on statistical patterns?

Can model confidence signals reliably improve reasoning quality and calibration?

Do models actually self-assess their confidence or just confirm answers?

How can identical external performance mask different internal representations?

What audit techniques best complement each other for detecting hidden model goals?

How can persona representations reduce language model variance and improve task accuracy?

Can LLMs infer psychological profiles without explicit user disclosure?

How do interface design choices shape consciousness attribution?

What makes dialogue-based explanation more successful than monologue?

Does inner subjective experience matter for discourse participation?

How do language models inherit human biases from training data?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why does DPO create introspective detection circuits but SFT does not?

What prevents language models from reliably adopting diverse personas?

What does zero-shot psychological profiling reveal about language model representations?

How can conversational AI maintain consistent personas across conversations?

What behavioral markers distinguish realized quasi-states from pretended ones?

How can LLM user simulators model realistic goal-driven conversation?

Where does the LLM interlocutor actually exist in the system?

Can LLM personas constitute genuine psychology or remain linguistic role-play?

Why do LLMs succeed at social roles without a stable self?

How do self-generated feedback mechanisms enable effective model learning?

Why does self-judgment of success or failure work without ground truth labels?

What are the consequences of models training on synthetic data?

Can models detect statistical properties of their own generation in real time?

How should models express uncertainty rather than forced confident answers?

Why does self-distillation suppress epistemic verbalization in student models?

How do training priors constrain what context information can override?

What is the difference between changing model outputs versus changing internal representations?

What constrains reinforcement learning's ability to expand model reasoning?

How do pairwise self-judgment and internal belief-shift replace verification differently?

What structural biases does transformer attention create in language model outputs?

How does attention sink behavior relate to internal model architecture?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do LLM explanations diverge from actual internal reasoning?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 114 in 2-hop network ·medium cluster Open in graph ↗

Can language models actually introspect about th… Do LLMs develop the same kind of mind as humans? Can language models describe their own learned beh… Do reasoning models actually use the hints they re… Do explicit and implicit self-recognition use the …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do LLMs develop the same kind of mind as humans? Explores whether LLMs and humans share the intersubjective linguistic training that shapes cognition, and whether that shared training produces equivalent forms of agency and reflexivity.
the introspection finding adds a specific mechanism: introspective access is possible for operational states but not for experiential ones
Can language models describe their own learned behaviors? Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
complementary finding: behavioral self-awareness emerges, but this paper adds the causal-linkage criterion for distinguishing genuine from performative self-reports
Do reasoning models actually use the hints they receive? This explores whether language models acknowledge reasoning hints in their explanations when those hints causally influence their answers. Understanding this gap matters for evaluating whether chain-of-thought explanations can be trusted for safety monitoring.
the inverse case: reasoning models fail to report on processes that ARE causally influencing them
Do explicit and implicit self-recognition use the same mechanism? Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?
extends: a concrete within-capability counterexample where the verbal channel is mechanistically disconnected from the implicit state it reports on

Can language models actually introspect about their own states?

Inquiring lines that read this note 65

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5