SYNTHESIS NOTE
Agentic Systems and Tool Use

Can live benchmarks prevent contamination in prediction tasks?

Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.

Synthesis note · 2026-06-03 · sourced from Evaluations

Future prediction is a hard agent task — analytical thinking, information gathering, decision under uncertainty — and until FutureX there was no large-scale benchmark for it, largely because real-time updates and timely answer-retrieval are hard to operate. FutureX's design choice is the keeper: it is a live benchmark that continuously collects questions from 195 trusted sites, gathers model predictions at each event's start date, and automatically checks actual outcomes. Being live is not a convenience — it is the contamination defense: a benchmark whose answers don't exist yet cannot leak into training data.

The capability finding across 25 models is equally clean: strong base models (e.g., DouBao-Seed1.6) handle straightforward questions, but hard open-ended prediction requires built-in search and reasoning, with deep-research and Think&Search agents (Grok-4, GPT-o4-mini) leading on the hardest tasks. Forecasting is therefore an agentic capability, not a base-model one.

This pairs directly with Batch 1's evaluation thread. Since Do automated benchmarks hide what frontier AI systems can really do?, FutureX is a concrete open-world instrument whose live-updating mechanism operationalizes the contamination-free, real-task ideal; and it complements Can frontier exams really measure cutting-edge AI capability? — where HLE restores discrimination on static knowledge, FutureX restores it on dynamic prediction. It also grounds Can LLMs actually forecast time series better than we think?: the gain comes from the search-and-reason workflow.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 85 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

future-prediction benchmarks must be live and contamination-free and open-ended forecasting requires search-and-reasoning agents not base models