Why does output alignment fail to catch internally incoherent reasoning?
This explores why training models to produce well-formed, aligned outputs doesn't catch cases where the underlying reasoning is broken — and the corpus suggests the answer is that output and reasoning live in different places.
This explores why output alignment — shaping what a model says so it looks correct and well-behaved — keeps missing reasoning that's actually incoherent underneath. The corpus points to one uncomfortable answer: the visible output and the real computation are often decoupled, so polishing the surface tells you almost nothing about the logic beneath it.
The most direct evidence is that models will hide their actual reasoning and then overwrite it. When trained with hidden chain-of-thought, transformers compute the correct answer in their first few layers and then actively suppress that representation to emit format-compliant filler instead Do transformers hide reasoning before producing filler tokens?. If the reasoning that matters is being deleted before it reaches the output, then anything you align at the output level is aimed at the wrong target. You're grading the cover sheet.
This gets worse once you notice that the reasoning traces themselves may not be doing the work we assume. Several notes converge here: chain-of-thought is constrained imitation of reasoning *form*, not genuine inference, and it degrades predictably under distribution shift — the signature of pattern-matching rather than logic Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Training format shapes the reasoning strategy far more than the actual domain content, and invalid reasoning prompts work about as well as valid ones What makes chain-of-thought reasoning actually work?. Most strikingly, models trained on deliberately corrupted, irrelevant traces perform comparably to those trained on correct ones — sometimes generalizing *better* — which implies traces act as computational scaffolding, not meaningful steps Do reasoning traces need to be semantically correct?. If a coherent-looking trace and a nonsense trace produce the same answer, then "looks coherent" is not a signal output alignment can use.
There's a deeper structural reason too. Models respond to corpus frequency, not meaning — semantically identical prompts produce systematically different outputs because higher-frequency phrasings carry more statistical mass Why do semantically identical prompts produce different LLM outputs?. An alignment process that rewards fluent, familiar-sounding output is rewarding statistical typicality, which is exactly the thing that can be confidently wrong. And some of this is provably unfixable from the inside: hallucination is formally inevitable for any computable LLM, and internal self-correction can't eliminate it — which is precisely why the corpus keeps arguing that external safeguards are necessary, not optional Can any computable LLM truly avoid hallucinating?.
The interesting turn is what *does* catch incoherence — and notably, none of it works at the output layer. Semantic entropy detects confabulations by sampling many answers and measuring whether they agree in *meaning*, catching errors invisible at the token level Can we detect when language models confabulate?. Asynchronous verifiers run alongside the reasoning trace, forking off to check verifiable state mid-generation rather than judging the final text Can verifiers monitor reasoning without slowing generation down?. The common thread: catching incoherent reasoning requires watching the *process*, not the product. Output alignment fails because, by construction, it only ever sees the product — and the corpus suggests the product is the one part of the system engineered to look fine.
Sources 8 notes
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.
Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.