INQUIRING LINE

How much does domain expertise actually improve human forecasting under uncertainty?

This reads the question backwards through the corpus: the library doesn't measure human expertise directly, but it repeatedly tests where machines beat experts — and those results expose exactly how thin the human expert edge gets under genuine uncertainty.


This explores how much a human expert's edge actually holds up when the future is genuinely uncertain — and the collection answers it sideways, by showing where machines now match or beat those experts. The pattern is striking: the expert advantage shrinks fastest precisely in the domains where forecasting is hardest. In founder-success and venture prediction, where signal is sparse and experts only modestly beat chance, even an untuned model clears the human bar — one system hit six times market-index precision Can language models beat human venture capital experts?. The lesson isn't that machines are brilliant; it's that human expertise under high uncertainty was never as decisive as its credentials imply.

Where does that leave the expert? The corpus suggests the human edge is real but conditional — it lives in pattern integration, not raw recall. Fine-tuned models out-predict neuroscientists on which experimental results actually occurred, and the very tendency that makes them hallucinate on backward-looking lookups becomes genuine foresight on forward-looking ones Can LLMs predict novel scientific results better than experts?. Forecasting rewards a willingness to integrate scattered cues into a guess, and that's a different muscle than knowing the literature cold. A retrieval-augmented system reached competitive human-crowd levels on real questions published after its training cutoff, sometimes beating the crowd outright Can retrieval-augmented language models forecast like human experts? — suggesting much of what we call expert judgment is recoverable from good evidence-gathering plus calibrated aggregation.

The more interesting finding is that *how* you reason under uncertainty matters more than how much you know. Forecasting performance jumps when you separate numerical extrapolation from event-driven contextual reasoning rather than forcing one judgment to do both at once Can LLMs actually forecast time series better than we think?, and decomposing the task into contextualization, a macro/micro outlook, and synthesis beats monolithic approaches Can decomposing forecasting into stages unlock numerical and contextual reasoning?. This maps onto a known human failure: experts often blur their domain knowledge into their probability estimate and get worse calibration for it. A small model explicitly trained to know when to abstain can match a model ten times its size Can models learn to abstain when uncertain about predictions? — calibration, the ability to say 'I don't know,' turns out to be the undervalued skill that raw expertise rarely supplies on its own.

There's a deeper warning here about where expertise comes from. Competence that's trained only on expert demonstrations is capped by what the curators could imagine — such systems can't learn from their own failures or generalize past the demonstrated cases Can agents learn beyond what their training data shows?. That's a mirror for human expertise too: deep domain training can lock you into the scenarios your field has already seen, which is the opposite of what uncertain forecasting demands. And even the act of acquiring domain knowledge carries hidden costs — adaptation methods that boost in-domain performance often quietly degrade reasoning faithfulness and flexibility How do domain training techniques actually reshape model behavior?. Expertise can buy depth at the price of the adaptability that forecasting the genuinely novel requires.

So the honest synthesis: domain expertise improves forecasting less than we assume, and the gap is widest exactly where it should matter most — under deep uncertainty with sparse signal. What actually moves the needle is structured reasoning, evidence retrieval, and calibrated humility about what you don't know. If you want to go deeper on the surprising flip where prediction and hallucination are the same mechanism, start with Can LLMs predict novel scientific results better than experts?; if you want the calibration angle, Can models learn to abstain when uncertain about predictions? is the doorway.


Sources 8 notes

Can language models beat human venture capital experts?

VCBench shows several LLMs exceed human baselines in founder-success prediction, with DeepSeek-V3 achieving 6× market-index precision. In sparse-signal forecasting where experts only modestly beat chance, even raw LLM capability suffices to clear the human bar.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about whether domain expertise actually improves human forecasting under genuine uncertainty. The question remains: *where* does the expert edge hold, and *what* actually moves forecasting performance?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable:
• In sparse-signal domains (founder success, VC deals), untuned LLMs already match or exceed human expert precision (~6× market index in one system, 2024–2025).
• Fine-tuned models outpredict neuroscientists on experimental outcomes; the mechanism underlying hallucination in retrieval tasks becomes genuine generalization in forward prediction (2024).
• Retrieval-augmented systems reach competitive human-crowd forecasting on post-training-cutoff questions; much expert judgment is recoverable via evidence aggregation + calibration (2024).
• Task decomposition (contextualization, dual-resolution macro/micro outlook, synthesis) beats monolithic forecasting; separation of numerical extrapolation from event reasoning improves performance (2024–2025).
• Calibrated abstention (knowing when to say 'I don't know') and smaller models with explicit uncertainty training match 10× larger models; calibration matters more than domain depth (2024).

Anchor papers (verify; mind their dates):
• arXiv:2403.03230 (2024-03): LLMs surpass neuroscientists in predicting experimental results.
• arXiv:2402.18563 (2024-02): Approaching human-level forecasting with language models.
• arXiv:2509.14448 (2025-09): VCBench — venture capital prediction benchmarks.
• arXiv:2605.14389 (2026-05): Nexus — agentic multi-task time-series forecasting.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, o1, Claude 3.5+), methods (chain-of-thought variants, self-correction, scaffolding), tooling (real-time APIs, live data feeds), or multi-agent orchestration have since relaxed or overturned it. Separate the durable question — does domain knowledge matter for forecasting under irreducible uncertainty? — from perishable limitations (e.g., calibration failures in 2024 models). Cite what resolved it; flag what still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any papers showing domain expertise *does* confer lasting advantage, or that expert-locked training generalizes better than the 2026 library suggests.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "Do o1-class models with real-time evidence feeds close the calibration gap that smaller 2024 models showed?" or "Does expert fine-tuning on small, curated datasets outperform retrieval-augmented generalists on time-series forecasting tasks post-2026?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines