SYNTHESIS NOTE

Can live benchmarks prevent contamination in prediction tasks?

Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.

Synthesis note · 2026-06-03 · sourced from Evaluations

Future prediction is a hard agent task — analytical thinking, information gathering, decision under uncertainty — and until FutureX there was no large-scale benchmark for it, largely because real-time updates and timely answer-retrieval are hard to operate. FutureX's design choice is the keeper: it is a live benchmark that continuously collects questions from 195 trusted sites, gathers model predictions at each event's start date, and automatically checks actual outcomes. Being live is not a convenience — it is the contamination defense: a benchmark whose answers don't exist yet cannot leak into training data.

The capability finding across 25 models is equally clean: strong base models (e.g., DouBao-Seed1.6) handle straightforward questions, but hard open-ended prediction requires built-in search and reasoning, with deep-research and Think&Search agents (Grok-4, GPT-o4-mini) leading on the hardest tasks. Forecasting is therefore an agentic capability, not a base-model one.

This pairs directly with Batch 1's evaluation thread. Since Do automated benchmarks hide what frontier AI systems can really do?, FutureX is a concrete open-world instrument whose live-updating mechanism operationalizes the contamination-free, real-task ideal; and it complements Can frontier exams really measure cutting-edge AI capability? — where HLE restores discrimination on static knowledge, FutureX restores it on dynamic prediction. It also grounds Can LLMs actually forecast time series better than we think?: the gain comes from the search-and-reason workflow.

Inquiring lines that read this note 5

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do benchmark improvements fail to reflect actual reasoning quality?

Can single-axis benchmarks accurately predict agent deployment success?

What real-world tasks most clearly expose gaps between benchmark performance and actual capability?

How can identical external performance mask different internal representations?

Why do benchmarks become saturated so quickly after initial launch?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 84 in 2-hop network ·medium cluster Open in graph ↗

Can live benchmarks prevent contamination in pre… Do automated benchmarks hide what frontier AI syst… Can frontier exams really measure cutting-edge AI … Can LLMs actually forecast time series better than…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do automated benchmarks hide what frontier AI systems can really do? Benchmarks optimize for auto-gradable, short, cheap tasks. But real AI capability emerges in long-horizon, messy, open-ended work. How much capability are we missing—or wrongly inflating—by relying on benchmark scores alone?
FutureX is a live, contamination-free instance of the open-world evaluation ideal
Can frontier exams really measure cutting-edge AI capability? Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?
complementary: static frontier exams vs dynamic prediction
Can LLMs actually forecast time series better than we think? Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.
both find forecasting gains come from agentic workflow not base model

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

future-prediction benchmarks must be live and contamination-free and open-ended forecasting requires search-and-reasoning agents not base models

Can live benchmarks prevent contamination in prediction tasks?

Inquiring lines that read this note 5

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4