SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Language, Text, and Discourse Psychology, Society, and Alignment

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Synthesis note · 2026-02-21 · sourced from Discourses
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

This is one of the more precise and counterintuitive findings in LLM interpretability: a knowledge probe can confirm that a fact is encoded in the model's internal representations — it can be extracted by a linear classifier — while that same fact fails to causally influence downstream generation.

The REMEDI paper is explicit: "even when an LM encodes information in its representations, this information may not causally influence subsequent generation." This has been independently documented by Ravfogel et al. (2020), Elazar et al. (2021), and Ravichander et al. (2021).

The mechanism: LM representations are computed as part of the forward pass, but which aspects of those representations actually influence the token generation at the end depends on attention patterns and downstream computations. A fact can be "stored" in a representation without that storage being on the causal path to the output.

This breaks a common assumption in interpretability and evaluation: that probing success implies behavioral relevance. If you can decode that the model "knows" something, you might assume it will generate outputs consistent with that knowledge. But this assumption is empirically false. The model may encode and fail to use.

The practical consequence for REMEDI: effective knowledge editing requires finding causal directions — representations that, when modified, actually change the output. Simply finding where knowledge is encoded is not sufficient. This is why REMEDI adds edited fact vectors to specific layers at specific tokens, not just anywhere in the representation.

For interpretability broadly: probing is a necessary but not sufficient condition for behavioral inference. Encoding ≠ generation.

Mechanistic interventions that close the gap: Two mechanistic interpretability approaches directly address this encoding-generation dissociation. Inference-Time Intervention (ITI) identifies a subset of attention heads where "truthful" directions can be extracted, then shifts activations along those directions at inference time — improving LLaMA truthfulness from 32.5% to 65.1% on TruthfulQA. The key insight: the model "knows" more than it "says," and the gap can be partially closed by targeting specific attention heads rather than the full representation. Eliciting Latent Knowledge (ELK) confirms this from a different angle: linear probes in middle layers can report a model's knowledge independently of what the model outputs, even when the model has been finetuned to produce systematically untruthful responses. Together, ITI and ELK demonstrate that the encoding-generation gap is not absolute — it can be bridged through targeted intervention on the causal pathways between encoded knowledge and generation.

Inquiring lines that use this note as a source 14

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
26 direct connections · 226 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

information encoded in lm representations may not causally influence generation