SYNTHESIS NOTE

Can LLMs predict novel scientific results better than experts?

Do language models excel at forecasting experimental outcomes in neuroscience when given only method descriptions? This challenges the assumption that LLMs are mere knowledge retrievers rather than pattern integrators.

Synthesis note · 2026-03-28 · sourced from Evaluations

BrainBench (Luo et al., 2024) creates a forward-looking benchmark where the task is predicting neuroscience experimental results from methods descriptions. Two versions of an abstract — one with real results, one with altered results — test whether the model can identify which results actually occurred.

The finding: LLMs surpass human neuroscience experts at this task. BrainGPT, an LLM fine-tuned on the neuroscience literature, performs better still. Like human experts, when LLMs indicate high confidence, their predictions are more likely to be correct.

The conceptual reframe is the real contribution. Most LLM benchmarks are backward-looking: they test whether models can retrieve or reason about known information. On backward-looking tasks, the model's tendency to "mix and integrate information from large and noisy datasets" is a failure mode — it produces hallucinations. But on forward-looking tasks — predicting novel outcomes — this same tendency becomes a virtue. Integration across noisy, interrelated findings IS what prediction requires.

This means hallucination and prediction may be mechanistically identical: both involve generating outputs that go beyond the literal input by drawing on patterns across training data. The difference is entirely in the task framing. When we ask "what did the paper find?" and the model generates a plausible-but-wrong answer, we call it hallucination. When we ask "what will this experiment find?" and the model generates a plausible-and-right answer, we call it prediction. The underlying computation may be the same.

This has implications for the fabrication/hallucination terminology debate. Since Should we call LLM errors hallucinations or fabrications?, the BrainBench finding suggests fabrication has a productive mode: fabrication in the service of prediction. The model fabricates (generates non-input-grounded content) in both cases — but one fabrication happens to be correct because it aligns with real-world patterns the model has internalized.

The practical implication: evaluating LLMs solely on backward-looking benchmarks systematically underestimates their value for forward-looking scientific tasks. The "practice of science and the pace of discovery would radically change" if LLMs are treated as prediction engines rather than knowledge retrieval systems.

Inquiring lines that read this note 26

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do language models inherit human biases from training data?

What properties determine whether reward signals teach genuine reasoning?

Why does combining natural language with numerical scores improve prediction accuracy?

How do evaluation biases undermine LLM quality assessment systems?

Why can LLMs generate ideas better than they evaluate them?

How can LLMs evaluate their own creative outputs for utility and novelty?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Can AI-generated outputs constitute genuine knowledge or valid claims?

How do expert priors constrain human researchers from exploring novel concepts?

How do language models establish social grounding in human dialogue?

Does social integration of LLMs increase their capacity to influence technological futures?

Do language models develop causal world models or rely on statistical patterns?

Why do LLM research ideas score high on novelty yet collapse into low diversity?

How can identical external performance mask different internal representations?

Why do rare cases in medicine and science require models that preserve tail distributions?

Can next-token prediction alone produce genuine language understanding?

Can prompting strategies overcome LLM biases without model fine-tuning?

Do monolithic prompts underutilize LLM strengths in forecasting workflows?

What determines success in training models on multiple tasks?

Do interaction effects between research mechanisms depend on the task domain?

How does memorization interact with learning and generalization?

Can experimental outcomes be reliably distilled into reusable insights?

What structural factors drive popularity bias in recommendation systems?

Can ranking by coherence while minimizing author-community coverage find novel research?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Do newer language model generations improve forecasting ability without additional training?

Does AI fluency substitute for verifiable accuracy in human judgment?

How much does domain expertise actually improve human forecasting under uncertainty?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Can LLMs predict novel scientific results better… Should we call LLM errors hallucinations or fabric… Can any computable LLM truly avoid hallucinating? Why do LLMs struggle to connect unrelated entities… Do foundation models learn world models or task-sp…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Should we call LLM errors hallucinations or fabrications? Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.
fabrication has a productive mode when task is forward-looking
Can any computable LLM truly avoid hallucinating? Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.
formal inevitability may be a feature for prediction tasks, not just a bug
Why do LLMs struggle to connect unrelated entities speculatively? LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.
BrainBench suggests predictive organization CAN succeed where speculative connection fails
Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
prediction success doesn't require world models; heuristic integration of patterns suffices

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

what is hallucination in a backward-looking task is generalization in a forward-looking task — LLMs surpass human experts at predicting neuroscience results

Can LLMs predict novel scientific results better than experts?

Inquiring lines that read this note 26

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4