How do AI researcher forecasts compare across different timeline question phrasings?
This reads two ways, so let me flag the gap up front: if you mean surveys where human AI researchers give wildly different timelines depending on how the question is worded (the classic 'machine intelligence' vs. 'automate all human jobs' framing effect), the corpus doesn't hold that material — but it has a surprisingly sharp answer to the deeper thing you're circling, which is how much a forecast bends to the way its question is phrased.
Let me be direct about the mismatch first: this library is about machines that forecast, not about polling human AI researchers on when AGI arrives, so it won't tell you whether experts gave 2040 to one phrasing and 2100 to another. What it does have is a strong, repeated finding that forecasts — and AI outputs generally — are extraordinarily sensitive to how the question is set up. That turns out to be the same underlying phenomenon as survey framing effects, just measured on models instead of people.
The clearest version comes from work showing that LLMs are far better forecasters than they look — but only when the question is decomposed correctly Can LLMs actually forecast time series better than we think?. Ask a model to forecast in one monolithic prompt and the capability stays hidden; split the task so numerical reasoning and contextual reasoning happen separately and the same model suddenly performs well Can decomposing forecasting into stages unlock numerical and contextual reasoning?. The 'forecast' didn't change because the model got smarter — it changed because the framing of the question did. That's framing-sensitivity in its purest form.
Zoom out and the corpus argues this isn't a quirk of forecasting but a property of the medium. AI outputs are described as fundamentally mutable — they shift with sampling, prompt wording, and even the audience reading them — which makes them resist the kind of fixed, repeatable answer you'd want from a forecast Why does AI output change with every prompt and context?. So if you asked an LLM the same timeline question three different ways, you should expect three different numbers, for structural reasons, not because any one is 'wrong.'
There's a twist that cuts the other way, though. When you ask many different models the same open-ended question, they don't diverge — they converge, producing strikingly similar answers because of overlapping training data and alignment ('the Artificial Hivemind') Do different AI models actually produce diverse outputs?. So phrasing moves a single model around a lot, but model identity barely moves the answer at all. The lesson for anyone treating AI forecasts as a crowd of independent estimates: the diversity is mostly illusory, and the real variance lives in the question, not the forecaster.
The thing you may not have known you wanted: retrieval-augmented systems can already forecast real future events at near-human-expert levels Can retrieval-augmented language models forecast like human experts? — which means the framing question is no longer academic. If a machine forecaster is competitive with human crowds, then how you phrase its question becomes a first-class source of error, exactly as it is in the human expert surveys you were originally asking about.
Sources 5 notes
LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.
Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.
AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.