INQUIRING LINE

How do AI researcher forecasts compare across different timeline question phrasings?

This reads two ways, so let me flag the gap up front: if you mean surveys where human AI researchers give wildly different timelines depending on how the question is worded (the classic 'machine intelligence' vs. 'automate all human jobs' framing effect), the corpus doesn't hold that material — but it has a surprisingly sharp answer to the deeper thing you're circling, which is how much a forecast bends to the way its question is phrased.


Let me be direct about the mismatch first: this library is about machines that forecast, not about polling human AI researchers on when AGI arrives, so it won't tell you whether experts gave 2040 to one phrasing and 2100 to another. What it does have is a strong, repeated finding that forecasts — and AI outputs generally — are extraordinarily sensitive to how the question is set up. That turns out to be the same underlying phenomenon as survey framing effects, just measured on models instead of people.

The clearest version comes from work showing that LLMs are far better forecasters than they look — but only when the question is decomposed correctly Can LLMs actually forecast time series better than we think?. Ask a model to forecast in one monolithic prompt and the capability stays hidden; split the task so numerical reasoning and contextual reasoning happen separately and the same model suddenly performs well Can decomposing forecasting into stages unlock numerical and contextual reasoning?. The 'forecast' didn't change because the model got smarter — it changed because the framing of the question did. That's framing-sensitivity in its purest form.

Zoom out and the corpus argues this isn't a quirk of forecasting but a property of the medium. AI outputs are described as fundamentally mutable — they shift with sampling, prompt wording, and even the audience reading them — which makes them resist the kind of fixed, repeatable answer you'd want from a forecast Why does AI output change with every prompt and context?. So if you asked an LLM the same timeline question three different ways, you should expect three different numbers, for structural reasons, not because any one is 'wrong.'

There's a twist that cuts the other way, though. When you ask many different models the same open-ended question, they don't diverge — they converge, producing strikingly similar answers because of overlapping training data and alignment ('the Artificial Hivemind') Do different AI models actually produce diverse outputs?. So phrasing moves a single model around a lot, but model identity barely moves the answer at all. The lesson for anyone treating AI forecasts as a crowd of independent estimates: the diversity is mostly illusory, and the real variance lives in the question, not the forecaster.

The thing you may not have known you wanted: retrieval-augmented systems can already forecast real future events at near-human-expert levels Can retrieval-augmented language models forecast like human experts? — which means the framing question is no longer academic. If a machine forecaster is competitive with human crowds, then how you phrase its question becomes a first-class source of error, exactly as it is in the human expert surveys you were originally asking about.


Sources 5 notes

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Why does AI output change with every prompt and context?

AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher evaluating whether framing effects on model forecasts remain a constraint or have been structurally relaxed. The question: Do different phrasings of the same timeline question produce materially different AI forecasts, and does this limit their reliability as independent estimates?

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2026. Key constraints identified:
• LLM forecasts are highly sensitive to question decomposition; monolithic prompts hide capability; split numerical and contextual reasoning separately, and performance jumps significantly (2024-02).
• Single-model outputs shift substantially with sampling, prompt wording, and audience context — outputs are "fundamentally mutable" rather than fixed (inferred from 2025-10).
• Different LLMs independently converge on similar answers despite identical questions, suggesting training-data overlap and alignment dominate variance over phrasing effects (Artificial Hivemind, 2025-10).
• Retrieval-augmented systems forecast real future events near human-expert levels (2024-02), elevating phrasing error from academic to operational concern.
• Multi-agent debate and uncertainty-aware planning methods (2023-05, 2024-02) show structural gains, but unclear whether they dissolve sensitivity or merely reduce it.

Anchor papers (verify; mind their dates):
• 2402.18563 – Approaching Human-Level Forecasting with Language Models (decomposition finding)
• 2510.22954 – Artificial Hivemind: The Open-Ended Homogeneity of Language Models (2025-10)
• 2605.14389 – Nexus: An Agentic Framework for Time Series Forecasting (2026-05)
• 2604.07981 – A Decomposition Perspective to Long-context Reasoning for LLMs (2026-04)

Your task:
(1) RE-TEST EACH CONSTRAINT. For decomposition sensitivity: have newer orchestration layers (agentic frameworks, memory systems, caching) now made framing transparent or automatic, or does manual prompt engineering still dominate real deployments? For the "mutable outputs" claim: do recent inference methods (speculative decoding, deterministic sampling) or fine-tuning strategies (DPO, RLHF variants post-2025) anchor outputs more reliably, or does sensitivity persist? For the Hivemind finding: do emerging model diversity (open-weight variants, domain-specific finetunes, multi-modal fusion) fracture the convergence, or does it hold? Cite what relaxed each constraint, or state plainly where it still applies.
(2) Surface the strongest contradicting or superseding work from the last ~6 months on forecast reproducibility, adversarial prompting, or controllable generation that reframes the sensitivity question.
(3) Propose 2 research questions assuming the regime has moved: (a) If decomposition or agentic scaffolding has automated away framing sensitivity, what NEW source of variance has emerged in competitive forecasting systems? (b) If the Hivemind effect still dominates, can you engineer INTENTIONAL divergence (heterogeneous prompting, diversity-penalty training) to recover independent estimates, and does that improve ensemble forecast quality?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines