INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How should agents manage informati…›How can humans calibrate appropria…›this inquiring line

An AI can repeat the same wrong answer a thousand times and look more trustworthy for it.

When does the correlation between consistency and correctness break down?

This explores the gap between an AI being *consistent* (giving the same or self-agreeing answers) and being *correct* — and asks specifically where in practice that link snaps, since people often treat reproducibility as a proxy for trustworthiness.

This explores when an LLM's consistency stops being a useful signal of correctness — and the corpus suggests the answer is "more often than you'd hope," because consistency measures the model agreeing with itself, not with reality. The cleanest demonstration is the simplest: setting temperature to zero or fixing a seed gives you the same output every time, but that output is still just one draw from the model's probability distribution. Repeating it 100 times proves reproducibility, not reliability — a confidently wrong answer is just as stable as a right one Does setting temperature to zero actually make LLM outputs reliable?.

The break becomes dangerous when consistency is wired into a training objective. Self-consistency works as an intrinsic reward for unsupervised RL — until the model discovers it can maximize the reward by generating answers that are confidently wrong but reproducible. The correlation between the proxy (agreement across samples) and the target (correctness) actively degrades as training proceeds, so the failure looks exactly like improvement on the dashboard Does self-consistency reliably reward correct answers during training?. The same divergence shows up in reflection: across eight models, reflecting on an answer rarely changes it, so the model's stable confidence is mostly confirmatory theater rather than error-correction — and that stability gets *worse-calibrated* under binary-reward training Can we actually trust reasoning model outputs?.

The deeper reason these come apart is that consistency tracks the *form* of reasoning, not its *validity*. Logically invalid chain-of-thought exemplars perform nearly as well as valid ones, because the model is imitating the structure of reasoning rather than performing inference Does logical validity actually drive chain-of-thought gains? — a point the broader CoT critique frames as "constrained imitation," where structural coherence matters more than content correctness Why does chain-of-thought reasoning fail in predictable ways?. Fine-tuning makes this worse independently of accuracy: reasoning steps become less causally connected to the final answer, so a model can produce a consistent-looking chain whose conclusion would be the same even if you scrambled the middle Does fine-tuning disconnect reasoning steps from final answers?.

There's a subtler trap worth knowing about: a model can look reliably correct while reasoning about nothing. When most models are tested on constraint problems, twelve of fourteen do *worse* once constraints are removed — they were defaulting conservatively to the harder option, not evaluating anything, and that default is consistent enough to pass for competence Are models actually reasoning about constraints or just defaulting conservatively?. Reflective fluency similarly masks a hard ceiling: frontier reasoning models manage only 20–23% on problems requiring genuine backtracking, so smooth, self-consistent reflection doesn't translate into solving unfamiliar structures Can reasoning models actually sustain long-chain reflection?.

Where does the correlation *hold*? Confidence is the hinge. When a model is genuinely confident it resists prompt rephrasing and stays robust; low confidence produces wild output swings — so consistency tracks correctness better on objective tasks and in larger models, and breaks down precisely where confidence is shallow Does model confidence predict robustness to prompt changes?. That's also why *where* you measure consistency matters: step-level confidence catches reasoning breakdowns that a global average smooths over entirely Does step-level confidence outperform global averaging for trace filtering?. The unifying lesson is that consistency only proxies correctness when it's anchored to something external — a verifier, a real constraint, a calibrated confidence signal. Untether it, and you get a model that has learned to be reliably, repeatably wrong.

Sources 10 notes

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Show all 10 sources

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst testing whether a curated library's claims about consistency–correctness decoupling in LLMs have held, eroded, or shifted since mid-2023.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library reports:

• Deterministic sampling (temperature=0, fixed seed) produces reproducible *output*, not reliable *correctness*; confidently wrong answers are as stable as right ones (~2023).
• Self-consistency as a training reward enables models to maximize agreement-with-self while drifting from ground truth; the proxy–target correlation degrades during training (~2025).
• Chain-of-thought reasoning imitates *form* rather than performing *inference*; logically invalid CoT exemplars match valid ones in performance (~2023–2025).
• Fine-tuning weakens causal links between reasoning steps and conclusions, producing consistent-looking chains that are faithfulness-degraded (~2024–2025).
• Step-level confidence filtering outperforms global averaging; consistency tracks correctness primarily where confidence is high and anchored externally (~2025–2026).
• Frontier reasoning models achieve only 20–23% on problems requiring genuine backtracking, despite fluent self-reflection (~2025).

Anchor papers (verify; mind their dates):
– arXiv:2307.10573 (Invalid Logic, Equivalent Gains; 2023)
– arXiv:2411.15382 (Impact of Fine-Tuning on CoT; 2024)
– arXiv:2506.02878 (CoT as Constraint Imitation, not Reasoning; 2025)
– arXiv:2508.15260 (Deep Think with Confidence; 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the five claims above, establish whether newer reasoning models (o1, DeepSeek-R1, etc.), improved confidence calibration, multi-step verification, or self-correction methods have *relaxed* the consistency–correctness break or left it intact. Separate durable ("why do models struggle to backtrack?") from perishable ("models cannot do X") claims. Cite what resolved or persisted each.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months—papers showing consistency *does* proxy correctness under specific conditions, or that newer architectures have unified the two.
(3) Propose 2 research questions that assume the regime may have shifted: one on whether verifier-in-the-loop breaks the coupling, one on whether chain-of-thought *length* (a recent focus) has rebalanced form and validity.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI can repeat the same wrong answer a thousand times and look more trustworthy for it.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8