SYNTHESIS NOTE

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Synthesis note · 2026-02-21 · sourced from Discourses

This is one of the more precise and counterintuitive findings in LLM interpretability: a knowledge probe can confirm that a fact is encoded in the model's internal representations — it can be extracted by a linear classifier — while that same fact fails to causally influence downstream generation.

The REMEDI paper is explicit: "even when an LM encodes information in its representations, this information may not causally influence subsequent generation." This has been independently documented by Ravfogel et al. (2020), Elazar et al. (2021), and Ravichander et al. (2021).

The mechanism: LM representations are computed as part of the forward pass, but which aspects of those representations actually influence the token generation at the end depends on attention patterns and downstream computations. A fact can be "stored" in a representation without that storage being on the causal path to the output.

This breaks a common assumption in interpretability and evaluation: that probing success implies behavioral relevance. If you can decode that the model "knows" something, you might assume it will generate outputs consistent with that knowledge. But this assumption is empirically false. The model may encode and fail to use.

The practical consequence for REMEDI: effective knowledge editing requires finding causal directions — representations that, when modified, actually change the output. Simply finding where knowledge is encoded is not sufficient. This is why REMEDI adds edited fact vectors to specific layers at specific tokens, not just anywhere in the representation.

For interpretability broadly: probing is a necessary but not sufficient condition for behavioral inference. Encoding ≠ generation.

Mechanistic interventions that close the gap: Two mechanistic interpretability approaches directly address this encoding-generation dissociation. Inference-Time Intervention (ITI) identifies a subset of attention heads where "truthful" directions can be extracted, then shifts activations along those directions at inference time — improving LLaMA truthfulness from 32.5% to 65.1% on TruthfulQA. The key insight: the model "knows" more than it "says," and the gap can be partially closed by targeting specific attention heads rather than the full representation. Eliciting Latent Knowledge (ELK) confirms this from a different angle: linear probes in middle layers can report a model's knowledge independently of what the model outputs, even when the model has been finetuned to produce systematically untruthful responses. Together, ITI and ELK demonstrate that the encoding-generation gap is not absolute — it can be bridged through targeted intervention on the causal pathways between encoded knowledge and generation.

Inquiring lines that read this note 14

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How should we design LLM systems to maintain alignment and control?

How does content-only knowledge in LLMs enable pretraining popularity to leak through?

Do language models learn genuine linguistic structure or just surface patterns?

Why does finetuning cause catastrophic forgetting of model capabilities?

What makes knowledge editing different from simply finding where facts are stored?

How do neural networks separate factual knowledge from reasoning abilities?

How do we distinguish knowledge encoding from knowledge usage in models?

How does memorization interact with learning and generalization?

Why is extracting training data insufficient proof that models memorize?

How do training priors constrain what context information can override?

How do LLMs infer information that was explicitly censored?

What articulatory information do speech signals carry that text cannot?

Do speech encoders actually learn the physics of how vocal tracts produce sound?

Is model self-awareness based on genuine introspection or pattern matching?

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

27 direct connections · 228 in 2-hop network ·medium cluster Open in graph ↗

Do language models actually use their encoded kn… Why do language models ignore information in their… Do classical knowledge definitions apply to AI sys… Why does reasoning training help math but hurt med… Can a model be truthful without actually being hon… Can high-level concepts replace circuit-level anal… Do personality traits activate hidden emoji patter… Why do language models fail to act on their own re… Can we trigger reasoning without explicit chain-of…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do language models ignore information in their context? Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
a specific case where encoding (the contextual information) fails to influence generation
Do classical knowledge definitions apply to AI systems? Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?
this finding further complicates what "knowledge" means in LLMs
Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
mechanistic substrate: layer localization explains the encoding-generation gap — lower-layer knowledge may fail to causally influence generation when higher-layer reasoning adjustment overrides or misapplies it
Can a model be truthful without actually being honest? Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
RepE framework provides the theoretical basis: truthfulness (output matches facts) and honesty (output matches beliefs) are separable, and the encoding-generation gap is one mechanism that produces their divergence
Can high-level concepts replace circuit-level analysis in AI? Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
ITI and ELK are both RepE-style interventions that work at the representation level rather than the circuit level
Do personality traits activate hidden emoji patterns in language models? When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.
a positive counterexample: personality-associated emoji patterns are encoded latently during pre-training and DO causally emerge through fine-tuning, demonstrating that the encoding-generation gap can be closed by targeted parameter-efficient activation of specific neurons
Why do language models fail to act on their own reasoning? LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
quantified instance
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
positive counterexample for reasoning: SAE-identified reasoning features ARE on the causal path — steering one feature activates reasoning across 6 model families, demonstrating that for reasoning specifically the encoding-generation gap can be fully closed: the 87% correct rationales vs 64% correct actions demonstrates the encoding-generation gap in action selection; the reasoning trace is generated through one pathway while action selection draws on shallower habitual computations, and RL fine-tuning partially closes the gap

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

information encoded in lm representations may not causally influence generation

Do language models actually use their encoded knowledge?

Inquiring lines that read this note 14

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4