SYNTHESIS NOTE
Psychology, Society, and Alignment Model Architecture and Internals

Do models know what they don't know?

Can language models develop internal representations that track their own knowledge boundaries? This matters because understanding self-knowledge mechanisms could explain how models choose between hallucination and refusal.

Synthesis note · 2026-02-23 · sourced from Knowledge Graphs
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Using sparse autoencoders (SAEs) on Gemma 2 (2B and 9B), researchers discovered that models develop internal representations of whether they "know" an entity — a form of self-knowledge about their own capabilities. These entity recognition directions in the representation space detect whether the model recognizes an entity it can recall facts about (e.g., detecting it doesn't know about a specific athlete or movie).

The key finding is causal steering: these directions don't just correlate with knowledge — they actively control behavior. Activating entity recognition features can steer the model to refuse questions about entities it actually knows, or to hallucinate attributes of unknown entities when it would otherwise refuse. This makes entity recognition a mechanistic gatekeeper for the hallucination-refusal trade-off.

The most striking implication: the SAEs were trained on the base model using pre-training data, yet the discovered directions have a causal effect on the chat model's refusal behavior — a behavior that was incentivized during finetuning, not pre-training. This provides evidence that chat finetuning repurposes existing mechanisms rather than creating new ones, consistent with the hypothesis that post-training reshapes rather than builds.

This connects to several existing threads:

Inquiring lines that use this note as a source 59

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

Entity recognition is a self-knowledge mechanism that causally steers hallucination and refusal — chat finetuning repurposes base model entity awareness