Why do LLMs generate ideas that sound novel but fail during execution?
This explores why LLM-generated ideas can look impressively novel on the page yet collapse when someone actually tries to build on them — and the corpus traces it to a structural split between the machinery that generates and the machinery that evaluates or executes.
This question is really about a gap, not a deficit. The corpus is unusually direct on it: when 43 expert researchers spent 100+ hours implementing randomly assigned ideas, LLM-generated ones dropped in quality far more than human ones once execution began, exposing impractical evaluation designs and missing technical groundwork that were invisible at the idea stage Do LLM research ideas actually hold up when experts try to execute them?. So the novelty is real — controlled studies of 100+ NLP researchers rated LLM ideas as statistically more novel than expert ideas — but feasibility lagged from the start Do language models generate more novel research ideas than experts?.
The reason the novelty and the fragility arrive together is the same reason: LLMs generate by combining concepts without the disciplinary constraints that make an expert's imagination narrower. Those constraints are also what tell an expert an idea won't work. Strip them away and you get wider, fresher combinations — and no internal brake. One line of work frames this sharply: ideation and evaluation are dissociated capabilities, and models systematically avoid the evaluative stance-taking needed to judge whether an idea is feasible Can LLMs generate more novel ideas than human experts?. The flip side shows up too — when models are pushed toward feasibility, novelty drops and diversity collapses, suggesting the two pull against each other rather than coexisting Why do LLMs excel at feasible design but struggle with novelty?.
This dissociation isn't unique to research ideation — it's a recurring shape across the collection. Models can explain a concept correctly, fail to apply it, and even recognize the failure, a triple pattern that points to functionally disconnected explanation and execution pathways rather than a missing fact Can LLMs understand concepts they cannot apply?. The same split shows up quantitatively as a kind of computational split-brain: 87% accuracy explaining principles versus 64% applying them Can language models understand without actually executing correctly?. Planning research lands in the same place — only 12% of GPT-4 plans are actually executable, because the model has planning knowledge but can't assemble the reasoning that handles how subgoals and resources interact Can large language models actually create executable plans?. These are catalogued together as structurally distinct epistemic failure modes, not random wrongness How do LLMs fail to know what they seem to understand?.
The deeper why sits underneath all of this. LLMs produce fluent output through statistical token relationships that aren't grounded in any execution or shared context — which is why the same mechanism generates both the brilliant-sounding idea and the broken one, with no internal difference between them Should we call LLM errors hallucinations or fabrications?. "Sounding novel" and "being executable" are simply produced by the same process that has no stake in whether the thing runs. And there's a creativity-research angle worth knowing: combinational, exploratory, and transformational reasoning are distinct modes, and current methods mostly address conventional problem-solving — so what looks like a fresh idea may be unconstrained recombination that was never pressure-tested against the harder paradigms Can LLMs reason creatively beyond conventional problem-solving?.
The thing you might not have expected to learn: the novelty isn't a sign of hidden competence waiting to be unlocked — it's a symptom of the missing constraint that would also have caught the flaws. The fix isn't a smarter generator; it's reconnecting generation to a real evaluative or executable check.
Sources 10 notes
When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.
A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.
LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.
Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Only 12% of GPT-4 generated plans are actually executable without errors. LLMs excel at acquiring planning knowledge but fail at the reasoning assembly required to handle subgoal and resource interactions.
LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.