Can LLMs predict novel scientific results better than experts?
Do language models excel at forecasting experimental outcomes in neuroscience when given only method descriptions? This challenges the assumption that LLMs are mere knowledge retrievers rather than pattern integrators.
BrainBench (Luo et al., 2024) creates a forward-looking benchmark where the task is predicting neuroscience experimental results from methods descriptions. Two versions of an abstract — one with real results, one with altered results — test whether the model can identify which results actually occurred.
The finding: LLMs surpass human neuroscience experts at this task. BrainGPT, an LLM fine-tuned on the neuroscience literature, performs better still. Like human experts, when LLMs indicate high confidence, their predictions are more likely to be correct.
The conceptual reframe is the real contribution. Most LLM benchmarks are backward-looking: they test whether models can retrieve or reason about known information. On backward-looking tasks, the model's tendency to "mix and integrate information from large and noisy datasets" is a failure mode — it produces hallucinations. But on forward-looking tasks — predicting novel outcomes — this same tendency becomes a virtue. Integration across noisy, interrelated findings IS what prediction requires.
This means hallucination and prediction may be mechanistically identical: both involve generating outputs that go beyond the literal input by drawing on patterns across training data. The difference is entirely in the task framing. When we ask "what did the paper find?" and the model generates a plausible-but-wrong answer, we call it hallucination. When we ask "what will this experiment find?" and the model generates a plausible-and-right answer, we call it prediction. The underlying computation may be the same.
This has implications for the fabrication/hallucination terminology debate. Since Should we call LLM errors hallucinations or fabrications?, the BrainBench finding suggests fabrication has a productive mode: fabrication in the service of prediction. The model fabricates (generates non-input-grounded content) in both cases — but one fabrication happens to be correct because it aligns with real-world patterns the model has internalized.
The practical implication: evaluating LLMs solely on backward-looking benchmarks systematically underestimates their value for forward-looking scientific tasks. The "practice of science and the pace of discovery would radically change" if LLMs are treated as prediction engines rather than knowledge retrieval systems.
Inquiring lines that use this note as a source 26
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do constrained versus unconstrained domains flip LLM novelty patterns?
- Why does combining natural language with numerical scores improve prediction accuracy?
- Why do LLM outputs match researcher priors without solving tasks correctly?
- How can LLMs evaluate their own creative outputs for utility and novelty?
- Why does LLM knowledge fail to influence their actual outputs?
- Why do users experience LLMs as peers rather than statistical tools?
- How do expert priors constrain human researchers from exploring novel concepts?
- Does social integration of LLMs increase their capacity to influence technological futures?
- Do LLMs rely on surface statistical patterns instead of causal structure?
- Can LLMs generate more novel research ideas than human experts?
- Which LLM backends produce the most executable research ideas?
- What makes LLMs media rather than tools that deliver intelligence?
- Why do rare cases in medicine and science require models that preserve tail distributions?
- Why do backward-looking benchmarks underestimate LLM scientific value?
- Do LLMs need world models to make accurate predictions?
- Does sequence prediction accuracy prove an underlying world model exists?
- Do monolithic prompts underutilize LLM strengths in forecasting workflows?
- Do interaction effects between research mechanisms depend on the task domain?
- What distinguishes scientific plausibility from cognitive availability in research ideas?
- Can experimental outcomes be reliably distilled into reusable insights?
- Can ranking by coherence while minimizing author-community coverage find novel research?
- Do newer language model generations improve forecasting ability without additional training?
- Can language models match competitive crowd forecasters on real future events?
- How much does domain expertise actually improve human forecasting under uncertainty?
- Why does LLM performance improve when forecasting tasks include organized reasoning?
- What capability boundary exists in LLM prediction of effect sizes?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Should we call LLM errors hallucinations or fabrications?
Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.
fabrication has a productive mode when task is forward-looking
-
Can any computable LLM truly avoid hallucinating?
Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.
formal inevitability may be a feature for prediction tasks, not just a bug
-
Why do LLMs struggle to connect unrelated entities speculatively?
LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.
BrainBench suggests predictive organization CAN succeed where speculative connection fails
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
prediction success doesn't require world models; heuristic integration of patterns suffices
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Large language models surpass human experts in predicting neuroscience results
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- Logical Reasoning in Large Language Models: A Survey
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
- Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Original note title
what is hallucination in a backward-looking task is generalization in a forward-looking task — LLMs surpass human experts at predicting neuroscience results