INQUIRING LINE

What distinguishes scientific plausibility from cognitive availability in research ideas?

This explores the gap between an idea that *sounds* right — novel, fluent, easy to generate — and one that actually survives scrutiny and execution; the corpus treats these as two separable properties, not one.


This explores the difference between an idea that comes to mind easily and reads as exciting (cognitive availability) and one that will actually hold up when someone tries to build it (scientific plausibility). The corpus suggests these come apart far more than we'd expect — and that LLMs are unusually good at the first while being weak at the second.

The cleanest evidence is a paired result. In a study of 100+ NLP researchers, LLM-generated ideas were rated *more* novel than expert ideas but slightly less feasible Do language models generate more novel research ideas than experts?. Then the same line of work followed 43 experts who spent 100+ hours actually implementing randomly assigned ideas — and the LLM ideas dropped sharply across every metric, revealing impractical evaluation designs and missing technical groundwork that were invisible at the ideation stage Do LLM research ideas actually hold up when experts try to execute them?. Novelty is a property of the idea as stated; plausibility is a property that only shows up under load. The model optimizes for the first because that's what surfaces in a one-paragraph pitch.

Why is availability so cheap for an LLM? Because the same pattern-integration that lets a model recombine concepts widely — and that produces hallucination in backward-looking tasks — is exactly what makes a fluent-sounding idea easy to produce Can LLMs predict novel scientific results better than experts?. Expert knowledge, by contrast, *constrains* novelty: experts won't propose the wild combination because they already know why it won't work. So availability and plausibility can even be inversely related — the easier an idea is to reach, the less the friction of feasibility has filtered it.

The corpus also hints at what plausibility actually requires, and it isn't more fluency. One thread shows that cognitive diversity improves group ideation only when members carry genuine senior domain expertise; without it, the brainstorming produces process losses rather than insight Does cognitive diversity alone improve multi-agent ideation quality?. Another shows that "scientific taste" — predicting which research will matter — is a *learnable but separate* capability, trained here on 700K citation-matched paper pairs, and explicitly distinct from execution skill Can models learn what makes research worth doing?. Plausibility, in other words, is a judgment grounded in community standing and track record — the very social context LLMs lose because they read text rather than inhabit the world where expertise is built Can language models distinguish expert arguments from common assumptions?.

The sharp takeaway: cognitive availability is the supply side of ideas (what's easy to generate and feels novel), and scientific plausibility is the demand side (what the field will actually reward and what survives building). They're trained, measured, and failed independently — which is why a system optimized purely for striking ideas will, under pressure, fabricate depth rather than possess it Why do deep research agents fabricate scholarly content?. If you want better ideas, you don't need a more creative generator; you need a separate taste model and an execution test, because the generator's strengths are precisely orthogonal to the thing you're trying to verify.


Sources 7 notes

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

Can models learn what makes research worth doing?

Reinforcement learning trained on 700K citation-matched paper pairs successfully teaches models to predict research impact better than GPT-5.2 and generate higher-impact research ideas. Scientific taste emerges as a community-aligned capability distinct from execution skills.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing the boundary between cognitive availability and scientific plausibility in LLM-assisted ideation. The question: what actually distinguishes an idea that feels novel and strikes a reader from one that survives implementation and earns community trust?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat all as perishable claims requiring re-test:
• LLM-generated research ideas rated ~15–20% more novel than expert ideas in one-paragraph pitches, but dropped sharply across all feasibility metrics after 100+ implementation hours (2025-06).
• Cognitive diversity in multi-agent ideation improves quality *only* when members carry genuine senior domain expertise; without it, brainstorming produces process loss (2025-08).
• "Scientific taste" — predicting which research will matter — is a learnable, separate capability trained on 700K citation-matched paper pairs, distinct from execution skill (2026-03).
• Hallucination in backward-looking tasks mirrors the pattern-integration that makes fluent forward-looking ideas cheap; experts constrain novelty because they already know failure modes (synthesis across corpus).

Anchor papers (verify; mind their dates):
• arXiv:2409.04109 (2024-09): "Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers."
• arXiv:2506.20803 (2025-06): "The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas."
• arXiv:2508.04575 (2025-08): "Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration."
• arXiv:2603.14473 (2026-03): "AI Can Learn Scientific Taste."

Your task:
(1) RE-TEST EACH CONSTRAINT. Has post-2026-Q3 work on fine-tuned taste models, multi-turn execution scaffolding, or domain-expert-in-the-loop ideation *relaxed* the ideation–execution gap? Can newer LLM versions or integrated planning tools recover feasibility signals during pitch-stage, or does plausibility remain invisible until build time? Separate the durable claim (availability ≠ plausibility) from the perishable limitation (current LLMs cannot predict plausibility pre-execution).

(2) Surface the strongest *contradicting or superseding* work from the last ~6 months — any paper showing LLMs now reliably flag infeasibility early, or that taste models have collapsed the gap, or that community feedback loops now substitute for expert domain knowledge.

(3) Propose 2 research questions that *assume* the regime may have shifted: (a) If taste models can be trained on citation graphs, can they be trained on *failed project* graphs to predict feasibility *before* 100-hour builds? (b) Does genuinely causal domain expertise (expert as co-ideator, not just evaluator) restore the constraint that availability and plausibility are inversely related?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines