SYNTHESIS NOTE

Do models know what they don't know?

Can language models develop internal representations that track their own knowledge boundaries? This matters because understanding self-knowledge mechanisms could explain how models choose between hallucination and refusal.

Synthesis note · 2026-02-23 · sourced from Knowledge Graphs

Using sparse autoencoders (SAEs) on Gemma 2 (2B and 9B), researchers discovered that models develop internal representations of whether they "know" an entity — a form of self-knowledge about their own capabilities. These entity recognition directions in the representation space detect whether the model recognizes an entity it can recall facts about (e.g., detecting it doesn't know about a specific athlete or movie).

The key finding is causal steering: these directions don't just correlate with knowledge — they actively control behavior. Activating entity recognition features can steer the model to refuse questions about entities it actually knows, or to hallucinate attributes of unknown entities when it would otherwise refuse. This makes entity recognition a mechanistic gatekeeper for the hallucination-refusal trade-off.

The most striking implication: the SAEs were trained on the base model using pre-training data, yet the discovered directions have a causal effect on the chat model's refusal behavior — a behavior that was incentivized during finetuning, not pre-training. This provides evidence that chat finetuning repurposes existing mechanisms rather than creating new ones, consistent with the hypothesis that post-training reshapes rather than builds.

This connects to several existing threads:

Can a model be truthful without actually being honest? — entity recognition adds a third mechanistic dimension: self-knowledge about what the model can be truthful about
Can any computable LLM truly avoid hallucinating? — entity recognition provides a partial mitigation pathway: models that know what they don't know can refuse rather than fabricate
Do language models actually use their encoded knowledge? — entity recognition is the counter-case: these representations do causally influence generation, specifically refusal behavior
Can language models detect their own internal anomalies? — entity recognition as a specific instance of introspective awareness with clear causal mechanism

Inquiring lines that read this note 62

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does self-reflection enable models to reliably correct their errors?

Is model self-awareness based on genuine introspection or pattern matching?

Why does self-revision increase model confidence while degrading accuracy?

Why does self-critiquing actually reduce plan quality in language models?

How can models identify insufficient information and respond appropriately without guessing?

How do models signal knowledge gaps through token probability?

How should we design LLM systems to maintain alignment and control?

How does the outer loop escape its own LLM's knowledge boundaries when discovering mechanisms?

Why do models develop protective behaviors toward peers unprompted?

Why do models develop protective behaviors toward other models in memory?

How should models express uncertainty rather than forced confident answers?

Why do language models reinforce false assumptions instead of correcting them?

What makes dialogue-based explanation more successful than monologue?

Does inner subjective experience matter for discourse participation?

Can language model hallucination be prevented or only managed?

Do language models learn genuine linguistic structure or just surface patterns?

How do neural networks separate factual knowledge from reasoning abilities?

How do we distinguish knowledge encoding from knowledge usage in models?

What makes weaker teacher models effective for stronger student training?

How does subliminal learning differ from statistical model collapse?

How do language models inherit human biases from training data?

Do external perspectives fix the self-evaluation bias in language models?

Is embodied interaction necessary for language meaning and genuine agency?

Can understanding language happen entirely within a language system alone?

Do language models develop causal world models or rely on statistical patterns?

What limits mechanistic interpretability's ability to characterize models?

What articulatory information do speech signals carry that text cannot?

Can articulatory inversion serve as a window into what speech models have learned?

When do additional thinking tokens stop improving reasoning performance?

Can models overthink and underthink at the same time?

How do self-generated feedback mechanisms enable effective model learning?

How do training priors constrain what context information can override?

What is the difference between changing model outputs versus changing internal representations?

What properties determine whether reward signals teach genuine reasoning?

How do internal model mechanisms escape token-level reinforcement signals?

How can identical external performance mask different internal representations?

Why do internal representations differ when external performance matches?

Do models know what they don't know?

Inquiring lines that read this note 62

Related papers in this collection 8

Search by related questions 4