INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What properties determine whether…›this inquiring line

Numbers tell an AI it was wrong, but only words can explain why — and that gap turns out to matter a lot.

Why does combining natural language with numerical scores improve prediction accuracy?

This explores why pairing words with numbers — critiques alongside reward scores, narrative context alongside time-series values — tends to beat either signal alone, and the corpus traces it to a single idea: a number tells you *what*, but language tells you *why* and *in what context*.

This explores why pairing words with numbers tends to beat either alone — and the corpus keeps returning to one explanation: a scalar score is informationally thin. It can tell a model that it was wrong, but not why it was wrong or how to fix it. The clearest demonstration is in reinforcement learning, where models stall on plateaus that more numerical reward simply cannot break through. When the same models are handed chain-of-thought critiques instead of (or alongside) the score, they start producing correct solutions — because the language carries the missing diagnostic content about the *mechanism* of failure that a single number compresses away Can natural language feedback overcome numerical reward plateaus?.

The forecasting work reframes the same point as a division of labor. Numerical extrapolation and contextual reasoning are genuinely different cognitive jobs, and forcing one model to do both at once muddies both. When the architecture *separates* them — a stage for crunching the numbers, a stage for the event-driven narrative, then a synthesis — accuracy jumps past both pure time-series models and pure language models Can decomposing forecasting into stages unlock numerical and contextual reasoning?. The striking follow-on is that this isn't the model getting smarter; the forecasting ability was latent all along and only *surfaced* once the workflow stopped blending the two signals into one monolithic prompt Can LLMs actually forecast time series better than we think?. So combining the modalities helps partly because keeping them distinct lets each be done well before they're joined.

There's a deeper reason language is such a potent partner to numbers here: it's the substrate where a model integrates patterns across everything it knows. The same pattern-completion tendency that produces hallucination on backward-looking retrieval becomes genuine predictive power on forward-looking tasks — fine-tuned LLMs out-predict human neuroscientists on which experiments actually worked, precisely because language lets them fuse scattered contextual cues a bare number never encodes Can LLMs predict novel scientific results better than experts?. In the same spirit, models trained on psychology-experiment data beat hand-built theoretical models at predicting human choices, because the linguistic representation captures individual context that a clean numerical model throws away Can language models learn to model human decision making?.

But the corpus also adds an important caution, which is the thing you might not have known you wanted to know: the numbers aren't just decoration on the words either. A well-calibrated confidence score can itself be a *teaching* signal — using a model's own answer-span confidence to rank reasoning traces both sharpens reasoning and repairs the calibration that human-feedback training tends to wreck Can model confidence work as a reward signal for reasoning?. And a model that knows how to attach an honest number to its own uncertainty — when to abstain, when to retrieve — can match models ten times its size Can models learn to abstain when uncertain about predictions?, Can simple uncertainty estimates beat complex adaptive retrieval?. So the gain from combining isn't that language is rich and numbers are crude; it's that the two carry *non-overlapping* information — language carries causes and context, numbers carry magnitude and calibrated confidence — and prediction needs both.

Worth knowing where this can go wrong: language is a leaky channel. Models often track the *surface frequency* of phrasings rather than their meaning Do language models really understand meaning or just surface frequency?, and strong training-time associations can override the very context you fed them in Why do language models ignore information in their context?. That's exactly why grounding language in a numerical signal — a reward, a calibrated confidence, a verifiable score — helps: the number anchors the words to something the model can't fluently talk its way around.

Sources 10 notes

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Show all 10 sources

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey2.52 match · arxiv ↗
Nexus: An Agentic Framework for Time Series Forecasting1.77 match · arxiv ↗
Post-Training Large Language Models via Reinforcement Learning from Self-Feedback1.73 match · arxiv ↗
Understanding and Mitigating Premature Confidence for Better LLM Reasoning1.72 match · arxiv ↗
Reported Confidence in LLMs Tracks Commitment More Than Correctness1.70 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning1.70 match · arxiv ↗
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse1.69 match · arxiv ↗
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds1.69 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about why natural language + numerical signals improve prediction. The question remains open: what is the *mechanism* — information complementarity, calibration correction, workflow decomposition, or something else — and does it still hold under current models and training regimes?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–May 2026. Key constraints the library reports:
• Numerical scores alone hit performance plateaus in RL; chain-of-thought critiques break through by carrying diagnostic info about *why* a model failed, not just that it failed (2025-06, arXiv:2506.03106).
• Forecasting ability is latent and only surfaces when numerical and linguistic reasoning are *separated* into distinct stages rather than blended in one prompt; architecturally decomposing signals lifts accuracy past single-modality baselines (2024-02, arXiv:2402.03284).
• Language as pattern-completion can generalize on forward-looking tasks (e.g., neuroscience predictions) where the same mechanism would hallucinate on backward-looking retrieval (2024-03, arXiv:2403.03230).
• Models' own answer-span confidence used as intrinsic reward simultaneously sharpens reasoning *and* repairs calibration damage from human feedback (2025-06, arXiv:2506.03106).
• High-frequency textual phrasings systematically override contextual intent; numerical anchors prevent models from "talking around" grounding (2026-04, arXiv:2604.02176).

Anchor papers (verify; mind their dates):
• arXiv:2506.03106 (2025-06) — Critique-GRPO: numerical + NL feedback in RL.
• arXiv:2402.03284 (2024-02) — Forecasting decomposition; latent ability thesis.
• arXiv:2403.03230 (2024-03) — LLM > humans on neuroscience; language as context fusion.
• arXiv:2604.02176 (2026-04) — Adam's Law: surface frequency bias in LLMs.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the RL plateau claim: have recent models (o1, o3, post-training RL methods like arXiv:2507.21931) *eliminated* plateaus or merely shifted them higher? For the workflow decomposition finding: does end-to-end fine-tuning now recover the gains without explicit separation? For the calibration repair via confidence: does GRPO or newer techniques hold or break under distribution shift? Separate what is still genuinely an open bottleneck from what may be solved by scale or orchestration.
(2) Surface the *strongest disagreement*: does any recent work (last 6 mo.) argue that the modality split is false — that a unified model with richer numerical representations (or richer language) makes the decomposition claim obsolete?
(3) Propose 2 research questions assuming the regime has moved: (a) Under agentic/multi-turn orchestration with memory and tool-use, do language + numbers still help, or does the *interaction structure* now matter more? (b) If new tokenization or training methods let models represent uncertainty natively in hidden states, does the need for explicit linguistic + numerical signals diminish?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Numbers tell an AI it was wrong, but only words can explain why — and that gap turns out to matter a lot.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8