INQUIRING LINE

Why do backward-looking benchmarks underestimate LLM scientific value?

This explores why evaluations built on already-known answers (backward-looking benchmarks) systematically miss the forward-looking value LLMs offer to science — and the corpus has a sharp answer that turns the usual 'hallucination' complaint on its head.


This explores why evaluations built on already-known answers miss what LLMs are actually good for in science. The cleanest statement of the problem comes from work showing that the same behavior we call a failure in one frame is the capability in another: a model's tendency to integrate patterns and 'fill in' plausible content is scored as hallucination when the task is to recall an established fact, but that exact tendency becomes genuine prediction when the task is to anticipate a result nobody has looked up yet Can LLMs predict novel scientific results better than experts?. On BrainBench, fine-tuned LLMs out-predicted human neuroscientists at guessing which experimental outcomes actually occurred. A benchmark that only rewards matching the known answer can't see this — it measures memory of the past and calls deviation error, when deviation toward the not-yet-known is precisely the scientific value.

The underestimation runs deeper than one reframing, because benchmarks are also built to be clean, and science is not. Standard NLP benchmarks systematically drop the instances where human annotators disagree, which removes exactly the ambiguous, contested cases that scientific frontier work lives in Do standard NLP benchmarks hide LLM ambiguity failures?. That filtering cuts both ways: it hides failures (a 32% vs. 90% accuracy gap), but it also means the benchmark never tests the model on the genuinely open questions where a useful conjecture matters more than a correct lookup. The evaluation is curated toward settled territory, so it can only certify settled-territory competence.

There's a second axis the backward-looking frame misses: time and horizon. Short, single-turn benchmarks simply don't predict how models behave over long, sustained scientific workflows — models that rank identically on one-shot tasks diverge dramatically over extended relays Do short benchmarks predict how models perform over long workflows?, and over those long chains errors can compound silently rather than plateau Do frontier LLMs silently corrupt documents in long workflows?. So the benchmark is doubly miscalibrated — it can under-rate the generative value on the upside and over-rate reliability on the downside, because both effects only show up in the messy, multi-step, ambiguous conditions the benchmark was designed to exclude.

Worth knowing for the skeptic: 'forward-looking value' is not a blank check. The same corpus is blunt that LLMs cannot actually execute iterative procedures — they recognize a problem as template-similar and emit plausible-but-wrong numbers rather than computing Do large language models actually perform iterative optimization? — and that models can explain a concept correctly while failing to apply it, a disconnect with no human analogue Can LLMs understand concepts they cannot apply?. The honest reading is that the pattern-integration engine is genuinely valuable for prediction and conjecture, and genuinely untrustworthy for execution and proof. A good benchmark would have to separate those, and the cleanliness of backward-looking benchmarks collapses them.

The twist the reader may not have expected: the fix isn't 'better benchmarks' in the obvious sense, because some of the alternatives we'd reach for are themselves corruptible — LLM-as-judge evaluations can be swayed by fake citations and rich formatting independent of content quality Can LLM judges be tricked without accessing their internals?. Measuring scientific value forward means rewarding calibrated prediction under ambiguity and over long horizons, which is far harder to score than checking an answer key — and that difficulty, not any lack of capability, is a large part of why the backward-looking number comes in low.


Sources 7 notes

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Do standard NLP benchmarks hide LLM ambiguity failures?

By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.

Do short benchmarks predict how models perform over long workflows?

DELEGATE-52 evaluated models across 50-round-trip relays and found short-interaction performance does not predict sustained delegation accuracy. Models ranking similarly on single-turn tasks diverged dramatically by relay 25, revealing degradation curves invisible to standard benchmarks.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether backward-looking benchmarks still underestimate LLM scientific value — a claim made across 2023–2026 curated work.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The core tension:
• Hallucination in recall tasks = generalization in prediction tasks; models out-predicted human neuroscientists on unknown outcomes (BrainBench, ~2024).
• Standard NLP benchmarks drop ambiguous disagreements, hiding both LLM failures AND frontier-relevant conjecture capability (~2024).
• Short single-turn benchmarks don't predict long-horizon workflow divergence; errors compound silently over extended chains (~2025).
• LLMs recognize problems as templates and emit plausible-wrong numbers rather than compute iterative procedures (~2026).
• LLM-as-judge evaluations are swayed by fake citations and formatting, independent of content quality (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2403.03230 (2024-03): Large language models surpass human experts in predicting neuroscience results
• arXiv:2509.09677 (2025-09): The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
• arXiv:2402.10669 (2024-02): Humans or LLMs as the Judge? A Study on Judgement Biases
• arXiv:2604.15597 (2026-04): LLMs Corrupt Your Documents When You Delegate

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, does newer tooling (longer-context, retrieval-augmented generation, multi-step verification, self-correction harnesses), training (reasoning-focused pretraining, RL on execution), or evaluation methodology (forward-looking benchmark design, constraint satisfaction metrics) now RELAX or OVERTURN it? Separate: the durable question (likely still open: *can benchmarks measure forward scientific value?*) from perishable limitations (e.g., *do single-turn benchmarks still fail to predict long-horizon behavior?* — cite if this has been solved or hardened).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers arguing benchmarks ARE adequate, or that prediction skill generalizes to execution, or that long-horizon LLM workflows are now reliable.
(3) Propose 2 research questions that ASSUME the evaluation regime may have shifted: e.g., *given improved long-context and constraint reasoning, do backward-looking benchmarks now better correlate with forward scientific utility?* or *can forward-looking evaluations be made non-corruptible by LLM judges?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines