SYNTHESIS NOTE
Psychology, Society, and Alignment

Can language models actually introspect about their own states?

Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.

Synthesis note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? What happens to social order when AI removes ritual constraints? Why do LLMs excel at social norms yet fail at theory of mind?

The question "can LLMs introspect?" has been stuck in a binary: either they have privileged access to their own states (implausible) or their self-reports are pure confabulation (too dismissive). The introspection paper proposes a third position — a "lightweight conception of introspection" that requires neither consciousness nor immediacy, only a causal process linking an internal state to an accurate self-report.

Two examples make the distinction concrete. When asked to describe the process behind its creative writing, an LLM claims to have "read the poem aloud several times" — an action it cannot perform. This self-report reflects the distribution of human self-reports in training data, not any actual internal process. It fails the causal linkage test because the content of the report has no pathway to the LLM's actual generation mechanism.

However, when Gemini is asked to estimate whether its sampling temperature is high or low, and given appropriate scaffolding (being told it is an LLM with a temperature parameter), it correctly infers "relatively low" by reasoning about the characteristics of its own recent outputs — consistency, accuracy, focus. The causal chain here is plausible: the model's outputs at low temperature have statistical properties (lower variance, more predictable) that the model can detect in its own generation history and accurately report on.

This conception aligns with "internally-directed theory of mind" accounts of human introspection — where the same theory-of-mind apparatus used to infer others' mental states gets turned back on one's own behavior. The model is not directly accessing its internal states but inferring them from observable consequences, which is also what many philosophers argue humans do.

The practical implication: LLM self-reports should not be uniformly trusted or dismissed. The discriminating question is whether a plausible causal pathway exists between the reported internal state and the generation of the report. Most self-reports about "thinking" or "feeling" fail this test. Some self-reports about detectable operational parameters may pass it.

Inquiring lines that use this note as a source 63

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 125 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llm self-reports mostly reflect training data distributions not introspection — but minimal introspection is possible when self-reports causally link to internal states