SYNTHESIS NOTE

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Synthesis note · 2026-04-18 · sourced from MechInterp

This paper synthesizes mechanistic interpretability findings into a philosophical framework that moves beyond the binary "does AI understand?" debate. The framework proposes three hierarchical tiers:

Tier 1: Conceptual understanding — arises when a model forms "features" as directions in latent space that unify diverse manifestations of a single entity or property. This is the representational foundation: the model has learned that different surface forms connect to the same underlying concept. MI evidence: SAE features, linear probing, representation geometry studies all demonstrate this.

Tier 2: State-of-the-world understanding — arises when the model learns contingent factual connections between features and dynamically tracks changes. "Michael Jordan is a basketball player" is not just a high-probability string but a reflection of an internal model linking the Michael Jordan concept to the basketball player concept. This goes beyond association to structured knowledge representation.

Tier 3: Principled understanding — arises when the model discovers compact "circuits" that connect facts via general rules rather than memorizing each fact individually. This is the shift from knowing that to knowing why. The grokking literature provides the clearest evidence: models that transition from memorization to generalization develop circuits implementing actual algorithmic rules (e.g., modular addition via Fourier transforms).

The critical insight is that higher-tier mechanisms coexist with lower-tier heuristics rather than replacing them. A model can have principled understanding of arithmetic in one circuit while relying on pattern-matching heuristics in another. This heterogeneity means understanding is not a single binary property but a patchwork: principled in some domains, merely conceptual in others, and purely heuristic in yet others.

This has direct implications for trust and deployment. The fact that a model demonstrates principled understanding in one domain gives no guarantee that it operates at the same tier in adjacent domains. The coexistence of understanding tiers also explains why models can be simultaneously impressive and brittle: the principled circuits work reliably, but the heuristic patches fail unpredictably.

Inquiring lines that read this note 71

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What limits mechanistic interpretability's ability to characterize models?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Is model self-awareness based on genuine introspection or pattern matching?

How should we design LLM systems to maintain alignment and control?

How do language models establish social grounding in human dialogue?

Why do reasoning models fail at systematic problem-solving and search?

Where do humans and language models actually diverge in reasoning ability?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

What factors beyond surface content determine how readers extract meaning differently?

What distinguishes genuine understanding from correct output without coherent principles?

Do language models learn genuine linguistic structure or just surface patterns?

How do language models inherit human biases from training data?

How do neural networks separate factual knowledge from reasoning abilities?

Is embodied interaction necessary for language meaning and genuine agency?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Where do LLMs fail as knowledge systems compared to humans?

Do language models develop causal world models or rely on statistical patterns?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

What critical LLM failures do standard benchmarks hide?

Why do benchmark tests fail to detect LLM comprehension gaps?

How can models identify insufficient information and respond appropriately without guessing?

Can models distinguish between activated knowledge and genuine reasoning?

Why do language models reinforce false assumptions instead of correcting them?

Can language models distinguish between novel insight and unjustified conceptual blending?

Do base models contain latent reasoning that training can unlock?

Can LLM personas constitute genuine psychology or remain linguistic role-play?

Is the distinction between pretense and realization meaningful for LLMs?

Do language models understand semantics or rely on pattern matching?

What semantic information is necessary to preserve for sound LLM reasoning?

How can LLM recommenders match or exceed collaborative filtering performance?

Why do LLMs rely on content knowledge instead of collaborative signals?

How do evaluation biases undermine LLM quality assessment systems?

What capability boundary exists in LLM prediction of effect sizes?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 154 in 2-hop network ·dense cluster Open in graph ↗

Do language models understand in fundamentally d… What happens inside models when they suddenly gene… Can LLMs understand concepts they cannot apply? Can AI pass every test while understanding nothing… Can a model be truthful without actually being hon…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What happens inside models when they suddenly generalize? Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
grokking is the mechanistic signature of the transition from tier 2 (state-of-world, memorized facts) to tier 3 (principled, circuit-based understanding)
Can LLMs understand concepts they cannot apply? Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding maps to cases where the model has tier-1 conceptual understanding (can explain) but lacks tier-3 principled understanding (cannot apply)
Can AI pass every test while understanding nothing? Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.
FER/imposter intelligence is a case where performance metrics cannot distinguish between tiers of understanding
Can a model be truthful without actually being honest? Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
the three-tier framework clarifies why: honesty requires tier-2 state-of-world understanding (tracking what the model itself believes), while truthfulness only requires that outputs match facts regardless of internal tier

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

mechanistic interpretability evidence supports three hierarchical varieties of LLM understanding — conceptual then state-of-world then principled — each tied to a distinct computational organization

Do language models understand in fundamentally different ways?

Inquiring lines that read this note 71

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5