INQUIRING LINE

Can language models beat human experts in domains with sparse historical signals?

This explores whether LLMs can outperform human experts specifically where the historical record is thin or under-represented — and the corpus suggests the real bottleneck isn't expertise but how densely a domain is represented in training data.


This explores whether LLMs can beat human experts in domains where the historical signal is sparse — and the corpus reframes the question: the deciding factor isn't human-vs-machine skill, it's how well-represented the domain is in the training data. Where signal is dense, models can genuinely surpass experts. LLMs finetuned on decades of psychology experiments predict human decisions more accurately than theory-driven cognitive models built by specialists Can language models learn to model human decision making?. That's the optimistic pole — abundant, structured signal lets the model out-predict the people who study the phenomenon.

The sparse pole looks very different. On a Supreme Court overruling benchmark, models systematically degrade on older cases for one reason: the training corpus over-represents recent precedent, leaving shallow representations of historical material Why do language models struggle with historical legal cases?. This is the direct answer to the question — when the historical signal is thin, performance falls, not because the reasoning is harder but because the data was never there. There's a clean theory for why: framing LLMs as autoregressive probability machines predicts that low-probability targets are systematically harder even when they're logically trivial Can we predict where language models will fail?. Rare history is, by definition, low-probability.

The tempting fix — just feed the model the sparse evidence at inference — runs into two walls. Strong parametric priors override supplied context, so the model generates what training taught it rather than what you put in front of it Why do language models ignore information in their context?. And prompt engineering can only reorganize knowledge already in the model; it cannot inject knowledge that was absent from training in the first place Can prompt optimization teach models knowledge they lack?. Together these set a hard ceiling: in a genuinely under-represented domain, no clever prompting recovers what the corpus never contained.

So the honest answer is conditional. Models beat experts where signal is dense and they can be adapted to it — and even adaptation has a sweet spot, since domain-training techniques buy performance at the cost of hidden degradation in reasoning faithfulness and transfer How do domain training techniques actually reshape model behavior?. There's also a capacity floor: on harder representational tasks like classifying argument schemes, only the largest models clear a usable threshold while smaller ones plateau Can large language models classify argument schemes reliably?. The thing you didn't know you wanted to know: 'sparse historical signal' isn't one problem but two stacked ceilings — the data was never sampled, and the model's own priors will quietly fill the gap with the present. Beating experts in those domains may require not a better model but a better corpus.


Sources 7 notes

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about whether LLMs can beat human experts in sparse-signal domains. The question remains open; treat the findings below as dated constraints to verify, not current truth.

What a curated library found — and when (findings span 2023–2026, note the dates):
• LLMs finetuned on dense, structured historical data (e.g., psychology experiments) *can* out-predict human specialists; but domain-training trades reasoning faithfulness and transfer capability for task performance (~2023–2025).
• On sparse historical signals (e.g., old Supreme Court cases), models systematically degrade because training corpora over-represent recent precedent, leaving shallow representations of rare history (~2024–2026).
• Prompting and in-context evidence cannot inject knowledge absent from training; strong parametric priors override supplied context (~2024).
• Only largest models cross a usable threshold on hard representational tasks (e.g., argument-scheme classification); smaller models plateau (~2024).
• Out-of-distribution shifts expose sparse representations: the farther a domain drifts from training, the sparser its LLM encoding (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.18703 (2023-05): Domain Specialization as the Key to Make Large Language Models Disruptive
• arXiv:2510.20941 (2025-10): Do LLMs Truly Understand When a Precedent Is Overruled?
• arXiv:2603.03415 (2026-03): Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
• arXiv:2502.10708 (2025-02): Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey

Your task:
(1) RE-TEST EACH CONSTRAINT. For dense-signal domains: have newer scaling laws, mixture-of-experts, or retrieval-augmented generation (RAG) since overturned the performance ceiling? For sparse signals: do adaptive routing, dynamic in-context learning, or post-training correction techniques now recover what the corpus omitted? Separate the durable question (can models beat experts?) from perishable limitations (current model capacity, training-data artifacts).
(2) Surface the strongest work from the last ~6 months that contradicts the "parametric priors override context" or "prompting cannot inject absent knowledge" claims. Does instruction-tuning, synthetic data augmentation, or continual learning now permit knowledge injection?
(3) Propose 2 research questions that assume the regime may have shifted: (a) If RAG + adaptive retrieval now recovers sparse signals, does the expert-vs-model gap collapse, and at what corpus-coverage threshold? (b) Do self-adaptive LLMs (arXiv:2501.06252) or Transformer2 methods dynamically reweight domain-specific parameters, permitting real-time specialization without static finetuning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines