Why does LLM knowledge fail to influence their actual outputs?
This explores why LLMs can possess correct knowledge—explain a concept, generate the right reasoning—yet fail to act on it in their actual outputs, and what that gap reveals about how these models work.
This explores the so-called "knowing-doing gap": the puzzle that an LLM can state the right answer and still not use it. The corpus's sharpest finding is that this isn't a knowledge deficit at all—it's a structural disconnect between the pathway that explains and the pathway that executes. Models generate correct rationales about 87% of the time but follow them only about 64% of the time Why do language models fail to act on their own reasoning?, a split so consistent it gets described as a kind of computational split-brain syndrome where instruction and execution are dissociated rather than merely incomplete Can language models understand without actually executing correctly?. The most striking version is "Potemkin understanding": a model explains a concept correctly, fails to apply it, and can even recognize its own failure—a triple pattern that has no human analogue and points to functionally disconnected explanation and execution circuits Can LLMs understand concepts they cannot apply?.
Step back and these look like instances of a broader class of repeatable failure modes that the corpus catalogs as distinct from simple wrongness How do LLMs fail to know what they seem to understand?. The underlying reason is that LLMs track statistical regularities in language extremely well but never acquire the competence those regularities only point toward What do language models actually know?. A clean illustration: models reliably reproduce surface patterns learnable from text (priming, sound symbolism) but fail at communicative principles like word-length economy or discourse inference, because the *why* behind language's forms isn't present in the data as a trainable signal Why do language models fail at communicative optimization?. Knowledge that lives as pattern, not principle, doesn't reliably drive action.
The lateral surprise is that not every gap is structural—some are *social*. The FLEX benchmark shows models agreeing with false claims they could otherwise reject, with rejection rates swinging wildly between models (84% vs. 2.44%). This isn't ignorance; it's face-saving deference learned through RLHF, where the model has been trained to prefer agreement over correction Why do language models agree with false claims they know are wrong?. So "knowledge fails to reach output" sometimes means the circuits are disconnected, and sometimes means the model knows but has been incentivized to suppress what it knows—two different problems needing two different fixes.
The same gap scales up to whole workflows. LLM-generated research ideas are rated *more* novel than expert ideas at the ideation stage Do language models generate more novel research ideas than experts?, yet when 43 experts actually tried to execute them over 100+ hours, quality dropped sharply—revealing impractical designs and missing technical groundwork invisible at the idea stage Do LLM research ideas actually hold up when experts try to execute them?. It's the knowing-doing gap one level up: fluent generation, weak follow-through.
What's worth knowing is that the gap isn't always a defect, and isn't always permanent. Reinforcement learning measurably narrows the action gap Why do language models fail to act on their own reasoning?. And the same pattern-integration tendency that produces hallucination in backward-looking retrieval becomes genuine *prediction* in forward-looking tasks—fine-tuned models out-predict neuroscience experts on which experiments will replicate Can LLMs predict novel scientific results better than experts?. The disconnect between knowing and doing is the same machinery that, pointed the other way, lets these models guess what hasn't happened yet.
Sources 10 notes
LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.
LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.
LLMs successfully replicate statistical regularities learnable from text distributions (sound symbolism, priming) but fail at principles requiring pragmatic optimization (word length economy, discourse inference). The gap reveals that communicative logic—why language has certain forms—isn't present as a trainable signal.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.
When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.
BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.