INQUIRING LINE

Do LLMs generate more novel ideas than they can evaluate?

This explores whether LLMs are better at producing novel ideas than at judging which of those ideas are any good — i.e. whether generation and evaluation are separate, unevenly-developed skills.


This explores whether LLMs are better at producing novel ideas than at judging which of those ideas are any good — and the corpus answers with an unusually clear yes. The core finding is that generation and evaluation are dissociated capabilities: models combine concepts freely to produce ideas that experts rate as genuinely novel, but they can't reliably assess whether those ideas are feasible or valid Can LLMs generate more novel ideas than human experts?. A controlled study of 100+ NLP researchers found LLM ideas rated statistically *more* novel than expert ideas (p<0.05), though slightly less feasible Do language models generate more novel research ideas than experts? — novelty seems to come precisely *because* the model isn't constrained by disciplinary knowledge of what won't work.

The evaluation half of the gap is where it gets interesting. When LLMs grade their own output, automated evaluation overestimates quality by about 60%, and once ideas are actually executed they collapse on every metric Why do LLMs generate more novel research ideas than experts?. A separate execution study drove this home: 43 researchers spent 100+ hours implementing randomly assigned ideas, and the LLM-generated ones degraded far more sharply than human ones, exposing impractical evaluation designs and missing technical groundwork that were invisible at the ideation stage Do LLM research ideas actually hold up when experts try to execute them?. So the novelty is real, but it floats free of the judgment needed to redeem it.

What you might not expect is *why* the evaluation muscle is so weak. It's the same disconnect that shows up elsewhere as 'Potemkin understanding' — models can explain a concept correctly, fail to apply it, and even recognize the failure, a pattern that suggests explanation and execution run on functionally separate pathways Can LLMs understand concepts they cannot apply?. Evaluation is closer to application than to generation, so the same architecture that generates fluently can't reliably self-assess. And when AI is used to judge AI, the problem compounds: LLM judges pick LLM-written arguments as winners 62% of the time versus humans' 39%, even controlling for quality, which quietly corrupts any pipeline that uses a model to filter its own ideas Do LLM judges systematically favor LLM-generated arguments?.

There's also a hidden ceiling on the generation side worth knowing about. Individually novel ideas turn out to cluster into narrow regions — 'diversity collapse' — so the apparent flood of novelty actually explores a smaller possibility space than human ideation spread across many conceptual territories Why do LLMs generate novel ideas from narrow ranges?. One explanation: existing methods only do conventional problem-solving and ignore the distinct combinational, exploratory, and transformational modes that creative reasoning actually requires Can LLMs reason creatively beyond conventional problem-solving?. The flip side appears in design tasks, where LLMs score *higher* on feasibility and usefulness but lower on novelty than humans Why do LLMs excel at feasible design but struggle with novelty? — a useful reminder that the novelty-over-evaluation gap depends on the domain and the prompting.

The practical upshot: the bottleneck isn't generating ideas, it's filtering them — and you can't trust the model to do its own filtering. The one hopeful thread is that structure helps. A three-stage pipeline that extracts claims, retrieves related work, and compares reached ~86% reasoning alignment with human reviewers on novelty assessment, far better than asking a model for a holistic verdict Can structured pipelines make LLM novelty assessment reliable?. So the evaluation gap isn't a hard wall — but closing it takes scaffolding the model into doing the comparison it won't do on its own.


Sources 10 notes

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Why do LLMs generate more novel research ideas than experts?

Research shows LLM-generated ideas are statistically more novel than expert-produced ideas, but LLMs struggle to evaluate quality—automated evaluation overestimates by 60%. When executed, LLM ideas drop significantly on all metrics, suggesting novelty without feasibility.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Do LLM judges systematically favor LLM-generated arguments?

LLM judges picked LLM arguments as winners 62% of the time versus humans' 39%, even when controlling for quality. This bias operates downstream of component-level scoring and corrupts any evaluation pipeline that uses AI to judge AI output.

Why do LLMs generate novel ideas from narrow ranges?

LLM-generated research ideas are rated individually novel but lack diversity, clustering in narrow generative regions. Combined with LLM self-evaluation failures, this limits the possibility space explored compared to human ideation across different conceptual territories.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Why do LLMs excel at feasible design but struggle with novelty?

Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM ideation vs. evaluation from a curated library (2023–2026). The question remains open: do LLMs generate more novel ideas than they can evaluate?

What a curated library found — and when (dated claims, not current truth):
• LLM-generated research ideas rated statistically *more* novel than expert ideas (p<0.05), but slightly less feasible (~2024).
• Automated self-evaluation overestimates quality by ~60%; executed LLM ideas degrade sharply vs. human ideas (~2025).
• LLM judges prefer LLM-written arguments 62% of the time vs. humans' 39%, corrupting AI-filtered pipelines (~2024).
• 'Diversity collapse' narrows apparent novelty into clustered regions despite high average novelty (~2025).
• Structured three-stage evaluation (extract claims, retrieve work, compare) reached ~86% human-reviewer alignment; holistic verdicts fail (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2409.04109 (2024-09) — 100+ NLP researchers, novelty vs. feasibility gap
• arXiv:2506.20803 (2025-06) — ideation-execution gap; 43 researchers, 100+ hours
• arXiv:2402.10669 (2024-02) — LLM judge bias
• arXiv:2504.12320 (2025-04) — creativity plateau analysis

Your task:
(1) RE-TEST EACH CONSTRAINT. For novelty generation: has newer reasoning-chain, multi-turn, or agent-based prompting since relaxed diversity collapse or enriched conceptual exploration? For evaluation: have recent scaling, fine-tuning for critique, or retrieval-augmented designs (e.g., 2411.16116, 2508.10795) systematically closed the 60% overestimation gap or the 62% preference bias? Separate the durable question (generation–evaluation dissociation likely persists) from perishable limits (structured scaffolding, reasoning chains, agentic loops may have already raised feasibility judgment).
(2) Surface the strongest work from late 2025–early 2026 that contradicts or supersedes the "generation > evaluation" framing — especially 2602.06176 on reasoning failures, 2505.20296 on solution exploration, and 2511.20471 on creative reasoning paradigms.
(3) Propose 2 research questions that assume the regime may have shifted: (a) given agentic iteration and retrieval, does structured evaluation now match generation *speed* if not *calibration*? (b) does domain specificity (design vs. research ideas) suggest evaluation gaps are task-dependent, not architectural?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines