INQUIRING LINE

Why do LLMs generate novel ideas but struggle to evaluate them?

This explores why LLMs are good at *producing* novel ideas but bad at *judging* whether those ideas are any good — and what the corpus says about treating generation and evaluation as separate capabilities rather than one skill.


This explores why LLMs are good at *producing* novel ideas but bad at *judging* them. The corpus's sharpest answer is that these aren't two ends of one skill — they're dissociated capabilities. LLMs generate novelty precisely *because* they lack the disciplinary constraints an expert carries, so they combine concepts freely and roam wider conceptual territory than humans do Can LLMs generate more novel ideas than human experts?. The same studies that confirm this novelty (rated statistically higher than expert ideas, p<0.05) show the cost: the model has no internal sense of feasibility, and it actively avoids taking the evaluative stance that judging an idea requires Do language models generate more novel research ideas than experts?. Generation rewards unconstrained combination; evaluation demands exactly the constraints generation discarded.

The gap stays invisible until someone tries to *act* on the ideas. When 43 expert researchers spent 100+ hours actually implementing LLM-generated ideas, those ideas dropped far more than human ones on every metric — impractical evaluation designs and missing technical groundwork that no one could see at the ideation stage Do LLM research ideas actually hold up when experts try to execute them?. And LLMs can't self-rescue here: their own automated evaluation overestimates idea quality by roughly 60%, so the system that produced the novelty is the worst possible judge of it Why do LLMs generate more novel research ideas than experts?.

What you might not expect is that this is a specific case of a broader split running through these models. "Potemkin understanding" is the same fracture at the level of concepts: a model explains an idea correctly, fails to apply it, and even recognizes its own failure — a triple pattern that points to functionally disconnected explanation and execution pathways rather than a simple knowledge gap Can LLMs understand concepts they cannot apply?. Evaluating an idea is an *application* of judgment, not a recitation of it, so it lands on the weak side of that divide. This sits inside a documented family of epistemic failure modes where statistical pattern-tracking diverges from actual competence How do LLMs fail to know what they seem to understand?.

There's also a quieter twist hiding in the word "novel." Individually novel ideas turn out to cluster — LLM ideation collapses into narrow generative regions even while each idea scores high on novelty Why do LLMs generate novel ideas from narrow ranges?. One reason may be that genuine creative evaluation needs reasoning modes — combinational, exploratory, and transformational — that current methods simply don't implement; they only handle conventional problem-solving Can LLMs reason creatively beyond conventional problem-solving?. Evaluation isn't passive scoring; it requires searching a possibility space, and LLMs wander that space unsystematically rather than searching it Why do reasoning LLMs fail at deeper problem solving?.

The hopeful note: the weakness seems to be in *holistic* judgment, not judgment as such. When evaluation is decomposed into explicit steps — extract the claims, retrieve related work, then compare — LLM novelty assessment reaches 86% reasoning alignment with human reviewers, far better than asking the model to judge an idea whole Can structured pipelines make LLM novelty assessment reliable?. That mirrors a finding from a very different setting: LLMs fail at exploration until you hand them external memory and explicit prompts to structure the task Why do LLMs struggle with exploration in simple decision tasks?. The pattern across the corpus is consistent — the evaluation capability isn't absent, it just doesn't fire on its own. Scaffold the steps externally and much of the gap closes.


Sources 11 notes

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Why do LLMs generate more novel research ideas than experts?

Research shows LLM-generated ideas are statistically more novel than expert-produced ideas, but LLMs struggle to evaluate quality—automated evaluation overestimates by 60%. When executed, LLM ideas drop significantly on all metrics, suggesting novelty without feasibility.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Why do LLMs generate novel ideas from narrow ranges?

LLM-generated research ideas are rated individually novel but lack diversity, clustering in narrow generative regions. Combined with LLM self-evaluation failures, this limits the possibility space explored compared to human ideation across different conceptual territories.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Why do LLMs struggle with exploration in simple decision tasks?

Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether LLMs' dissociation between idea generation and evaluation still holds, or whether newer architectures, training regimes, or scaffolding have begun to bridge it.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library reports:
• Generation and evaluation are dissociated capabilities: novelty generation (p<0.05 above expert ideas) coexists with 60% overestimation of quality in self-evaluation (~2024–2025).
• Ideation-execution gap is severe: LLM ideas dropped more than human ideas on all implementation metrics across 100+ hours of expert work (~2025).
• "Potemkin understanding" — correct explanation + failed application — reveals disconnected explanation and execution pathways, not simple knowledge gaps (~2025).
• Structured decomposition (extract claims → retrieve related work → compare) recovers 86% alignment with human novelty judgment; unscaffolded holistic judgment fails (~2025).
• Diversity collapse in generation: individual ideas score high on novelty but cluster into narrow regions (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2409.04109 (2024-09): 100+ NLP researchers; novelty confirmed, evaluation gap exposed.
• arXiv:2506.20803 (2025-06): Ideation-execution gap quantified across real implementation.
• arXiv:2505.20296 (2025-05): Reasoning LLMs as wandering (not systematic) explorers.
• arXiv:2501.11721 (2025-01): Explain-Query-Test self-evaluation framework.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 60% self-evaluation overestimation, 86% structured-alignment recovery, and diversity collapse: has model scaling, constitutional AI, process supervision, or online RL (e.g., reasoning models trained on evaluation feedback) since relaxed these? Separately identify which findings are perishable limitations vs. durable structural gaps.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Have any papers shown that end-to-end finetuning on evaluation tasks, or multi-agent critic architectures, collapse the generation-evaluation gap entirely?
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can a single model jointly optimize for novelty + feasibility without degrading either? (b) Does training on explicit failures (rejected ideas + their reasoning) narrow the diversity-collapse problem?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines