INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How do evaluation biases undermine…›this inquiring line

Ideas that experts rated more creative than humans' fell apart the moment someone actually tried to build them.

Can proxy evaluation of ideas accurately predict their quality without implementation?

This explores whether we can judge an idea's quality from the idea alone — proxy evaluation by humans or LLMs — or whether real quality only shows up once someone actually builds it.

This reads the question as: can a cheap stand-in for execution — an expert skim, an LLM judge, a novelty score — tell us whether an idea is actually good? The corpus's sharpest answer is a warning. When 43 expert researchers spent 100+ hours actually implementing AI-generated research ideas, the ideas that had scored as *more* novel than human ones at the proposal stage collapsed across every metric once executed — execution surfaced impractical evaluation designs and missing technical groundwork that were simply invisible at the idea stage Do LLM research ideas actually hold up when experts try to execute them?. So the headline finding is that proxy evaluation can be confidently, systematically wrong, and wrong in the optimistic direction.

Why does the gap open? A recurring theme is that evaluators reward *form* over *substance*. Imitation-trained models fool human judges by adopting a confident, fluent style while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. Chains of thought that are logically invalid score nearly as well as valid ones, because the model — and the evaluator — latches onto the look of reasoning rather than genuine inference Does logical validity actually drive chain-of-thought gains?. Models fine-tuned on labeled quality examples learn surface patterns instead of principled criteria and fail to transfer to new argument types Can models learn argument quality from labeled examples alone?. The pattern across all three: cheap proxies measure surface signals that correlate with quality on familiar cases and break exactly where you most need them — on the genuinely novel.

But the corpus doesn't say proxy evaluation is hopeless — it says *holistic* proxy evaluation is the problem, and structure is the fix. Decompose the judgment and reliability returns. A three-stage novelty pipeline (extract claims, retrieve related work, compare) hit 86% reasoning alignment with human ICLR reviewers, beating holistic LLM baselines Can structured pipelines make LLM novelty assessment reliable?. Breaking instruction-quality into verifiable sub-criteria via checklists reduces overfitting to superficial artifacts Can breaking down instructions into checklists improve AI reward signals?, and prompt quality itself turns out to have six measurable dimensions rather than one vibe Can we measure prompt quality independent of model outputs?. Agentic evaluation that actively collects evidence cut judge error 100x over a single LLM judge — though its memory module cascaded errors, a reminder that the evaluator can introduce its own failure modes Can agents evaluate AI outputs more reliably than language models?.

The deeper cross-cutting insight comes from the self-improvement work: there's a structural *generation–verification gap*. Pure self-improvement stalls because a system's ability to judge an idea is fundamentally weaker than its ability to produce one, and reliable methods only work by smuggling in external anchors — past versions, third-party judges, user corrections, tool feedback Can models reliably improve themselves without external feedback?. Implementation is the ultimate external anchor. This also reframes where ideation effort should go: multi-agent ideation only beats solo work when the agents carry real senior domain expertise; diversity without grounded knowledge underperforms a single competent agent Does cognitive diversity alone improve multi-agent ideation quality?. Expertise is, in effect, an internalized proxy for what execution would reveal.

What you didn't know you wanted to know: even simulated *humans* track this honesty gradient. AI persona panels replicated 76% of published experimental main effects, but their success correlated with the original p-value strength — they nailed the strong, obvious effects and turned unreliable exactly on the marginal ones Can AI personas reliably replicate human experiment results?. That's the through-line of the whole corpus: proxy evaluation is trustworthy in proportion to how obvious the answer already was. It predicts quality well where you needed it least, and fails where the idea is novel, marginal, or untested — which is precisely the territory where you were hoping the proxy would save you the cost of building.

Sources 11 notes

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Show all 11 sources

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

Can AI personas reliably replicate human experiment results?

Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether proxy evaluation (expert judgment, LLM scoring, novelty metrics) can predict idea quality without building it. A curated library of AI/LLM papers (2023–2025) found systematic failures; your job is to separate durable constraints from possibly-resolved ones.

What a curated library found — and when (findings span 2023–2025, but treat as dated claims):
• Proxy evaluators systematically reward *form* over substance: confident fluency, logically invalid reasoning, and surface patterns all score nearly as high as genuine quality, collapsing once ideas are actually implemented (2025).
• Holistic LLM judgment and human expert scoring fail exactly on novel ideas; a 43-expert implementation study found AI-generated proposals rated as *more* novel at the proposal stage dropped across every metric post-execution (2025).
• Structured decomposition works: three-stage novelty pipelines achieve 86% alignment with human ICLR reviewers; checklists and multi-dimensional scoring beat single holistic judgments (2025).
• A fundamental generation–verification gap: systems judge ideas weaker than they produce them; self-improvement without external anchors (implementation, third-party judges, user feedback) stalls (2025).
• Proxy evaluation predicts quality in inverse proportion to novelty: AI persona panels replicated 76% of published effects but succeeded only on strong, obvious results, failing on marginal ones (2024).

Anchor papers (verify; mind their dates):
• arXiv:2506.20803 (2025) — The Ideation-Execution Gap: direct evidence that proxy scoring diverges from execution outcomes.
• arXiv:2412.02674 (2025) — Mind the Gap: the generation–verification asymmetry and why external anchors are mandatory.
• arXiv:2507.18624 (2025) — Checklists Are Better Than Reward Models: structured decomposition outperforms holistic LLM judges.
• arXiv:2408.16073 (2024) — LLM Persona Simulations: the reliability gradient tied to effect strength.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above—especially the claim that proxy judges fail on novel ideas and that structured decomposition helps—determine whether newer models (o1, o3, Claude 4+), improved retrieval/RAG, better chain-of-thought, or multi-agent verifiers have relaxed the generation–verification gap or improved post-hoc judgment without implementation. Where has execution been replaced or approximated? Where does the constraint still hold? Cite what resolved it.
(2) **Surface the strongest contradicting or superseding work from the last 6 months.** Are there papers showing that iterative refinement, tool-use, or simulation can close the ideation–execution gap? Do any report successful proxy prediction on genuinely novel ideas?
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., "Can agentic verifiers with access to external knowledge bases and simulation engines predict execution quality better than static LLM judges?" or "Does structured decomposition + iterative feedback from weak execution simulators close the gap?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Ideas that experts rated more creative than humans' fell apart the moment someone actually tried to build them.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8