INQUIRING LINE

Do newer language model generations improve forecasting ability without additional training?

This explores whether forecasting skill rides along for free as labs ship newer base models — and the corpus says yes, but with a twist that's more interesting than the question expects.


This explores whether forecasting ability is something newer model generations simply inherit, no fine-tuning required. The short answer the corpus gives is yes — but it buries a more useful finding underneath. The clearest direct evidence comes from a retrieval-augmented forecasting system that reached near-parity with competitive human forecasters on real questions published *after* the models' training cutoffs, sometimes beating the crowd, and where newer model generations improved forecasting accuracy without any domain-specific tuning Can retrieval-augmented language models forecast like human experts?. So generational lift is real: the same scaffold gets sharper as the underlying model gets better.

But the corpus keeps pointing past raw model strength toward *how you structure the task*. One line of work finds that LLMs have far stronger intrinsic forecasting ability than benchmarks suggest — but only when the workflow separates numerical reasoning from contextual reasoning; ask a model to do both at once in a single prompt and the capability stays hidden Can LLMs actually forecast time series better than we think?. A related system, Nexus, beats both pure time-series models and plain LLMs by splitting forecasting into distinct stages — contextualize, then make a macro/micro outlook, then synthesize Can decomposing forecasting into stages unlock numerical and contextual reasoning?. The implication: a lot of what looks like "this generation can't forecast" is actually "nobody decomposed the task." Architecture can unlock more than a model upgrade.

There's also a domain-dependence the question hides. In fields where human experts only modestly beat chance — venture capital founder-success prediction, for instance — even raw, untuned LLM capability clears the human bar, with one model hitting 6× the market-index precision Can language models beat human venture capital experts?. And in forward-looking scientific prediction, the very pattern-completion habit that produces hallucination on backward-looking retrieval becomes genuine foresight: fine-tuned models out-predicted neuroscientists on which experiments actually replicated Can LLMs predict novel scientific results better than experts?. Forecasting, in other words, may be less a special skill and more a reframing of what these models already do.

The counterweight is worth knowing before you bet on "just wait for the next model." Scaling isn't a universal solvent: on genuine constrained-optimization tasks LLMs plateau at 55–60% regardless of parameter count or training regime, a ceiling rather than a gap Do larger language models solve constrained optimization better?. Prompting and prompt optimization can only reorganize knowledge already in the training distribution — they can't inject what the model never learned Can prompt optimization teach models knowledge they lack?. And a persistently undertrained dimension is calibration: small models trained to know when to abstain can match models 10× their size on conversation forecasting, which suggests standard generational upgrades don't automatically teach a model *when to shut up* Can models learn to abstain when uncertain about predictions?.

So the honest synthesis is layered: newer generations do improve forecasting for free where the signal is already latent in their training and the task is framed to surface it — but generational lift, task decomposition, and calibration are three separate levers, and the corpus repeatedly finds the second one (how you structure the workflow) doing more work than the first one (which model you loaded).


Sources 8 notes

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Can language models beat human venture capital experts?

VCBench shows several LLMs exceed human baselines in founder-success prediction, with DeepSeek-V3 achieving 6× market-index precision. In sparse-signal forecasting where experts only modestly beat chance, even raw LLM capability suffices to clear the human bar.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether newer LLM generations improve forecasting ability without task-specific training — a question the field treated as settled ~2024–2025. A curated library (spanning Nov 2023–May 2026) found these dated claims:

**What a curated library found — and when:**
- Retrieval-augmented forecasting systems reach near-human-expert parity on unseen events *after* training cutoff; newer generations lift accuracy without domain tuning (Feb 2024).
- LLMs' forecasting ability is far stronger than benchmarks reveal, but only when numerical and contextual reasoning are decomposed into separate steps — unified prompts hide the capability (Feb 2024).
- Task decomposition (Nexus: contextualize → macro/micro outlook → synthesize) outperforms both time-series models and plain LLMs, suggesting architecture unlocks more than model scaling (May 2026).
- In low-expert-baseline domains (VC founder prediction), untuned LLMs surpass humans 6× over market baseline; in forward-looking science prediction, what appears as hallucination in retrieval becomes genuine foresight (Feb 2024, Mar 2024).
- Calibration (knowing when to abstain) does NOT scale automatically: small calibrated models match 10× larger uncalibrated ones on conversation forecasting (Feb 2024).
- Genuine constraint-satisfaction tasks plateau at 55–60% regardless of scale or training regime — a hard ceiling, not a gap (Mar 2026).

**Anchor papers (verify; mind their dates):**
- arXiv:2402.18563 (Feb 2024): Approaching Human-Level Forecasting with Language Models
- arXiv:2605.14389 (May 2026): Nexus: An Agentic Framework for Time Series Forecasting
- arXiv:2603.23004 (Mar 2026): Can Large Language Models Reason and Optimize Under Constraints?
- arXiv:2402.03284 (Feb 2024): Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For the claim that newer generations *automatically* improve forecasting: does recent work (last 6 months, especially post-o1, post-Gemini 2.0, or equivalents) show that scaling alone now relaxes the decomposition requirement? Can newer models unify numerical + contextual reasoning without pipeline restructuring? Separate the durable question (whether generational lift is *necessary* for forecasting) from the perishable limitation (whether decomposition is *sufficient* without it). Ground any relaxation in specific benchmarks or real-world tasks.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. If recent papers claim end-to-end forecasting without decomposition, or show calibration *does* scale, cite them and explain the reconciliation.

(3) **Propose 2 research questions that ASSUME the regime may have moved:** One assuming decomposition is now *optional* for frontier models; one testing whether calibration has become an emergent property of sufficiently large or instruction-tuned systems.

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines