Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
This is one of the more precise and counterintuitive findings in LLM interpretability: a knowledge probe can confirm that a fact is encoded in the model's internal representations — it can be extracted by a linear classifier — while that same fact fails to causally influence downstream generation.
The REMEDI paper is explicit: "even when an LM encodes information in its representations, this information may not causally influence subsequent generation." This has been independently documented by Ravfogel et al. (2020), Elazar et al. (2021), and Ravichander et al. (2021).
The mechanism: LM representations are computed as part of the forward pass, but which aspects of those representations actually influence the token generation at the end depends on attention patterns and downstream computations. A fact can be "stored" in a representation without that storage being on the causal path to the output.
This breaks a common assumption in interpretability and evaluation: that probing success implies behavioral relevance. If you can decode that the model "knows" something, you might assume it will generate outputs consistent with that knowledge. But this assumption is empirically false. The model may encode and fail to use.
The practical consequence for REMEDI: effective knowledge editing requires finding causal directions — representations that, when modified, actually change the output. Simply finding where knowledge is encoded is not sufficient. This is why REMEDI adds edited fact vectors to specific layers at specific tokens, not just anywhere in the representation.
For interpretability broadly: probing is a necessary but not sufficient condition for behavioral inference. Encoding ≠ generation.
Mechanistic interventions that close the gap: Two mechanistic interpretability approaches directly address this encoding-generation dissociation. Inference-Time Intervention (ITI) identifies a subset of attention heads where "truthful" directions can be extracted, then shifts activations along those directions at inference time — improving LLaMA truthfulness from 32.5% to 65.1% on TruthfulQA. The key insight: the model "knows" more than it "says," and the gap can be partially closed by targeting specific attention heads rather than the full representation. Eliciting Latent Knowledge (ELK) confirms this from a different angle: linear probes in middle layers can report a model's knowledge independently of what the model outputs, even when the model has been finetuned to produce systematically untruthful responses. Together, ITI and ELK demonstrate that the encoding-generation gap is not absolute — it can be bridged through targeted intervention on the causal pathways between encoded knowledge and generation.
Inquiring lines that use this note as a source 14
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does content-only knowledge in LLMs enable pretraining popularity to leak through?
- Is relevant knowledge encoded in LMs but not causally active in generation?
- What makes knowledge editing different from simply finding where facts are stored?
- How do we distinguish knowledge encoding from knowledge usage in models?
- Does encoded knowledge in language models actually influence what they generate?
- Does encoding information in LM representations guarantee it influences output?
- When does encoded knowledge fail to influence language model generation?
- Why is extracting training data insufficient proof that models memorize?
- How do LLMs infer information that was explicitly censored?
- Why might encoded world knowledge fail to actually influence language model outputs?
- Can linear probing detect all the concepts a language model actually uses?
- Do speech encoders actually learn the physics of how vocal tracts produce sound?
- Do internal belief probes reveal what models actually know versus report?
- Do models verbalize their implicit knowledge when that knowledge influences their output?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models ignore information in their context?
Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
a specific case where encoding (the contextual information) fails to influence generation
-
Do classical knowledge definitions apply to AI systems?
Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?
this finding further complicates what "knowledge" means in LLMs
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
mechanistic substrate: layer localization explains the encoding-generation gap — lower-layer knowledge may fail to causally influence generation when higher-layer reasoning adjustment overrides or misapplies it
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
RepE framework provides the theoretical basis: truthfulness (output matches facts) and honesty (output matches beliefs) are separable, and the encoding-generation gap is one mechanism that produces their divergence
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
ITI and ELK are both RepE-style interventions that work at the representation level rather than the circuit level
-
Do personality traits activate hidden emoji patterns in language models?
When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.
a positive counterexample: personality-associated emoji patterns are encoded latently during pre-training and DO causally emerge through fine-tuning, demonstrating that the encoding-generation gap can be closed by targeted parameter-efficient activation of specific neurons
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
quantified instance
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
positive counterexample for reasoning: SAE-identified reasoning features ARE on the causal path — steering one feature activates reasoning across 6 model families, demonstrating that for reasoning specifically the encoding-generation gap can be fully closed: the 87% correct rationales vs 64% correct actions demonstrates the encoding-generation gap in action selection; the reasoning trace is generated through one pathway while action selection draws on shallower habitual computations, and RL fine-tuning partially closes the gap
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Language models show human-like content effects on reasoning tasks
- Word Meanings in Transformer Language Models
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
- The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
Original note title
information encoded in lm representations may not causally influence generation