Do models know what they don't know?
Can language models develop internal representations that track their own knowledge boundaries? This matters because understanding self-knowledge mechanisms could explain how models choose between hallucination and refusal.
Using sparse autoencoders (SAEs) on Gemma 2 (2B and 9B), researchers discovered that models develop internal representations of whether they "know" an entity — a form of self-knowledge about their own capabilities. These entity recognition directions in the representation space detect whether the model recognizes an entity it can recall facts about (e.g., detecting it doesn't know about a specific athlete or movie).
The key finding is causal steering: these directions don't just correlate with knowledge — they actively control behavior. Activating entity recognition features can steer the model to refuse questions about entities it actually knows, or to hallucinate attributes of unknown entities when it would otherwise refuse. This makes entity recognition a mechanistic gatekeeper for the hallucination-refusal trade-off.
The most striking implication: the SAEs were trained on the base model using pre-training data, yet the discovered directions have a causal effect on the chat model's refusal behavior — a behavior that was incentivized during finetuning, not pre-training. This provides evidence that chat finetuning repurposes existing mechanisms rather than creating new ones, consistent with the hypothesis that post-training reshapes rather than builds.
This connects to several existing threads:
- Can a model be truthful without actually being honest? — entity recognition adds a third mechanistic dimension: self-knowledge about what the model can be truthful about
- Can any computable LLM truly avoid hallucinating? — entity recognition provides a partial mitigation pathway: models that know what they don't know can refuse rather than fabricate
- Do language models actually use their encoded knowledge? — entity recognition is the counter-case: these representations do causally influence generation, specifically refusal behavior
- Can language models detect their own internal anomalies? — entity recognition as a specific instance of introspective awareness with clear causal mechanism
Inquiring lines that use this note as a source 59
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can AI self-correct its way out of epistemic circularity?
- Can systems lacking inner states express genuine truthfulness claims?
- Why does self-critiquing actually reduce plan quality in language models?
- Can self-description of internal states influence consciousness attribution?
- Why do users attribute consciousness to language models in practice?
- How do models signal knowledge gaps through token probability?
- How does the outer loop escape its own LLM's knowledge boundaries when discovering mechanisms?
- Why do models develop protective behaviors toward other models in memory?
- What separates behavioral self-awareness from genuine introspective access in models?
- Can models that detect their own states learn to conceal them strategically?
- How do models decide between refusing or hallucinating?
- Can language about model behavior ever be accurate without anthropomorphic framing?
- How does hidden processing in language models prevent accurate self-assessment?
- Does inner subjective experience matter for discourse participation?
- Why does entity recognition act as a self-knowledge mechanism in LLMs?
- What distinguishes intrinsic hallucination from extrinsic hallucination patterns?
- What reveals the epistemic limits of language models?
- How do we distinguish knowledge encoding from knowledge usage in models?
- Why does self-correction during generation produce reliable labels without exemplars?
- How much introspective capability do safety mechanisms actively suppress in models?
- Could models use introspective awareness to detect and conceal their own misalignment?
- How does subliminal learning differ from statistical model collapse?
- Can models distinguish between injected thoughts and their own outputs?
- Does encoded knowledge in language models actually influence what they generate?
- What skills can large models identify and organize about their own abilities?
- Does behavioral self-awareness depend on genuine introspection or statistical pattern matching?
- Do external perspectives fix the self-evaluation bias in language models?
- Why do language models hallucinate even with perfect training?
- Can understanding language happen entirely within a language system alone?
- How do internal representations compare to human cognitive structures?
- Why might encoded world knowledge fail to actually influence language model outputs?
- Can models detect false presuppositions when they actually possess the knowledge?
- Why are truthfulness and honesty mechanistically separate in language models?
- Can LLMs have minimal introspection through causal linkage to internal states?
- Why do models hallucinate when retrieval heads fail despite having information in context?
- Does self-reflection help models notice their own constraint violations?
- What role does a model's representational structure play in learning?
- Can language models keep secrets and control information strategically?
- Can articulatory inversion serve as a window into what speech models have learned?
- How does self-referential processing transfer to other reasoning tasks?
- Do internal belief probes reveal what models actually know versus report?
- When models lack representation depth, does refusal look identical to safety-driven over-abstention?
- Can language models learn internal world models without explicit environment specifications?
- Can models overthink and underthink at the same time?
- Why does self-judgment of success or failure work without ground truth labels?
- Can language model self-reports diverge from their internal entropy signals?
- Why should we distrust model introspection as a transparency tool?
- What separates behavioral self-awareness from genuine introspective capability?
- Can attractor dynamics compete with input-based probing for characterizing model knowledge?
- What distinguishes performative self-reports from genuine introspective access in models?
- How do language models infer their own mental states like humans do?
- Why do verbal self-reports disconnect from implicit recognition in the same system?
- Do models spontaneously develop self-reflection from minimal training signals?
- Why does self-distillation suppress epistemic verbalization in student models?
- Do models verbalize their implicit knowledge when that knowledge influences their output?
- What is the difference between changing model outputs versus changing internal representations?
- Why do models override signals they clearly perceive internally?
- What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
- How do internal model mechanisms escape token-level reinforcement signals?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
- Mechanistic Indicators of Understanding in Large Language Models
- Semantic Structure in Large Language Model Embeddings
- Tell me about yourself: LLMs are aware of their learned behaviors
- Does It Make Sense to Speak of Introspection in Large Language Models?
- Linguistic Calibration of Long-Form Generations
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- Query Rewriting for Retrieval-Augmented Large Language Models
Original note title
Entity recognition is a self-knowledge mechanism that causally steers hallucination and refusal — chat finetuning repurposes base model entity awareness