SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Psychology, Society, and Alignment

Can LLMs predict novel scientific results better than experts?

Do language models excel at forecasting experimental outcomes in neuroscience when given only method descriptions? This challenges the assumption that LLMs are mere knowledge retrievers rather than pattern integrators.

Synthesis note · 2026-03-28 · sourced from Evaluations
What kind of thing is an LLM really? What do language models actually know?

BrainBench (Luo et al., 2024) creates a forward-looking benchmark where the task is predicting neuroscience experimental results from methods descriptions. Two versions of an abstract — one with real results, one with altered results — test whether the model can identify which results actually occurred.

The finding: LLMs surpass human neuroscience experts at this task. BrainGPT, an LLM fine-tuned on the neuroscience literature, performs better still. Like human experts, when LLMs indicate high confidence, their predictions are more likely to be correct.

The conceptual reframe is the real contribution. Most LLM benchmarks are backward-looking: they test whether models can retrieve or reason about known information. On backward-looking tasks, the model's tendency to "mix and integrate information from large and noisy datasets" is a failure mode — it produces hallucinations. But on forward-looking tasks — predicting novel outcomes — this same tendency becomes a virtue. Integration across noisy, interrelated findings IS what prediction requires.

This means hallucination and prediction may be mechanistically identical: both involve generating outputs that go beyond the literal input by drawing on patterns across training data. The difference is entirely in the task framing. When we ask "what did the paper find?" and the model generates a plausible-but-wrong answer, we call it hallucination. When we ask "what will this experiment find?" and the model generates a plausible-and-right answer, we call it prediction. The underlying computation may be the same.

This has implications for the fabrication/hallucination terminology debate. Since Should we call LLM errors hallucinations or fabrications?, the BrainBench finding suggests fabrication has a productive mode: fabrication in the service of prediction. The model fabricates (generates non-input-grounded content) in both cases — but one fabrication happens to be correct because it aligns with real-world patterns the model has internalized.

The practical implication: evaluating LLMs solely on backward-looking benchmarks systematically underestimates their value for forward-looking scientific tasks. The "practice of science and the pace of discovery would radically change" if LLMs are treated as prediction engines rather than knowledge retrieval systems.

Inquiring lines that use this note as a source 26

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 133 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

what is hallucination in a backward-looking task is generalization in a forward-looking task — LLMs surpass human experts at predicting neuroscience results