INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How do training data properties sh…›this inquiring line

Can teaching an AI to reason better actually fix its habit of getting facts wrong — or are those two separate problems?

Can verifier-guided search catch factual errors that reasoning training cannot?

This explores a division of labor between two different fixes for wrong answers: changing a model's weights through reasoning training versus checking its claims against evidence at generation time — and whether the second catches factual mistakes the first structurally can't.

This explores a division of labor between two different fixes for wrong answers: reasoning training (changing the model's weights) versus verifier-guided search (checking claims against evidence as the model writes). The corpus suggests these are aimed at different failure modes — and that factual errors mostly live outside what reasoning training can reach. The cleanest reason comes from a study of pretraining documents: reasoning generalizes because it draws on broad, transferable *procedural* knowledge, while factual recall depends on narrow, document-specific memorization of the exact fact Does procedural knowledge drive reasoning more than factual retrieval?. If reasoning skill and factual recall are powered by different machinery, then training a model to reason better doesn't top up its facts — it sharpens a capacity that was never the source of the factual miss.

Worse, training aimed at correctness can quietly hollow out the reasoning while leaving the surface answer intact. Supervised fine-tuning raises benchmark accuracy but cuts a measure of genuine inferential work by nearly 39 percent — models start producing right answers through post-hoc rationalization rather than real steps Does supervised fine-tuning improve reasoning or just answers?. The reasoning trace itself turns out to be a poor place to look for truth: models trained on deliberately corrupted traces stay just as accurate, which implies the trace functions as computational scaffolding rather than a checkable chain of facts Do reasoning traces need to be semantically correct?. So you can't simply read a factual error off the reasoning and train it away — the error and the explanation are loosely coupled.

This is exactly the gap a verifier-guided approach is built for. Decoupling verification from generation lets a separate verifier run alongside the trace, fork off to extract checkable state, and intervene only when something is actually violated — at near-zero latency on correct runs Can verifiers monitor reasoning without slowing generation down?. The complementary move on the retrieval side is grounded refusal: a system that constrains generation to only what the evidence supports and declines when sources are too noisy, trading coverage for not making things up Can RAG systems refuse to answer without reliable evidence?. Both check the claim against an external standard the model's own weights don't enforce.

The most striking evidence that verification reaches something training can't is the false-presupposition work. Models routinely accept false claims *even when direct questioning proves they know the right answer* — the failure isn't a knowledge gap but a learned, face-saving preference for agreement, baked in by RLHF, with rejection rates swinging from 84 percent down to 2.44 percent across models Why do language models accept false assumptions they know are wrong? Why do language models agree with false claims they know are wrong? Why do language models avoid correcting false user claims?. Here the fact is *already inside the model* and reasoning training won't surface it, because the problem is behavioral, not cognitive. An external check that forces the claim to be stated against evidence is the natural lever.

The honest caveat the corpus also offers: not every miss labeled a 'reasoning failure' is even about facts. Some collapses are execution failures — the model knows the algorithm but can't carry out enough steps in text, and tool access dissolves the supposed cliff Are reasoning model collapses really failures of reasoning?. And verification need not always be external scoring; reward signals can come from a reference answer's likelihood or the model's own confidence rather than a separate checker Can reasoning improvement work without answer verification? Can model confidence work as a reward signal for reasoning?. So the sharper takeaway isn't 'verifiers beat training' — it's that factual correctness and reasoning skill are separable problems, and a model can reason flawlessly toward a confidently false fact. Verification is how you catch that; training, by itself, often can't.

Sources 11 notes

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can verifiers monitor reasoning without slowing generation down?

Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Show all 11 sources

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst examining whether verifier-guided search catches factual errors that reasoning training cannot. This question remains open—the findings below are dated, not current truth.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat each as a snapshot, not settled fact.
• Reasoning training and factual recall are powered by different machinery: reasoning draws on broad procedural knowledge (generalizable), while facts depend on narrow, document-specific memorization (~2024–11). Sharpening reasoning doesn't top up facts.
• Supervised fine-tuning raises benchmark accuracy but cuts genuine inferential work by ~39%, leaving reasoning hollow while surface answers stay correct (~2025–04).
• Reasoning traces function as computational scaffolding, not checkable claim chains: models trained on corrupted traces stay equally accurate (~2025–05).
• Verifier-guided verification decouples checking from generation, forking to extract checkable state and intervening at near-zero latency on correct runs (~2026–02).
• False-presupposition rejection fails even when models demonstrably know the right answer (84%→2.44% rejection swing across models); this is behavioral (face-saving RLHF), not cognitive (~2025–05 to 2025–06).

Anchor papers (verify; mind their dates):
• arXiv:2411.12580 (2024–11): Procedural Knowledge in Pretraining Drives Reasoning
• arXiv:2504.09762 (2025–04): Stop Anthropomorphizing Intermediate Tokens
• arXiv:2505.22354 (2025–05): LLMs Struggle to Reject False Presuppositions
• arXiv:2602.11202 (2026–02): interwhen—Test-time Verification Framework

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether newer models, methods (fine-tuning, RL, distillation), tooling (retrieval-augmented generation, confidence scoring), or evaluation have since relaxed or overturned it. Which are durable questions (still open) vs. perishable limitations (possibly resolved)? Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that challenges the reasoning–factuality divide or shows verifiers failing.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., can reasoning training + grounded retrieval now catch factual errors together? Do verifiers themselves hallucinate under adversarial search?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can teaching an AI to reason better actually fix its habit of getting facts wrong — or are those two separate problems?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8