Why does latent chain-of-thought fail so easily in training?
Explores why latent reasoning is fragile compared to textual chain-of-thought, focusing on how outcome-only supervision creates gradient starvation and representational drift in learned reasoning trajectories.
Why is latent chain-of-thought so hard to train robustly when textual CoT is comparatively easy? This paper gives an information-theoretic answer: latent CoT fails by a dual collapse. Outcome-only supervision (reward the final answer, ignore the trajectory) produces (1) gradient attenuation along the optimization path — the signal is too far from the latent steps to shape them — and (2) representational drift, where the latent trajectory wanders without a semantic tether. The fix decomposes into two complementary axes: Trajectory Supervision (inject dense stepwise signal) and Space Supervision (preserve the geometry of the latent manifold). The sharp, non-obvious finding is that how you do space supervision matters: rigid geometric compression collapses the high-dimensional reasoning manifold onto sparse static points, while generative reconstruction acts as a flexible semantic anchor that preserves intrinsic dimensionality.
This connects two threads that rarely meet. The trajectory-supervision half is the latent-space analogue of the process-vs-outcome reward debate the vault already holds. Since Why do outcome-based reward models fail at intermediate step evaluation?, outcome supervision is known to underserve intermediate steps in token space — here the same pathology appears in latent space, as gradient attenuation. And Can trajectory structure replace hand-annotated process rewards? echoes the move from sparse outcome to dense process signal without expensive annotation. The second half explains why latent reasoning is harder than verbal: the medium of the latent chain — which Can continuous thoughts have tractable likelihoods for sampling and scoring? tries to make scorable — has no built-in semantic floor, so it needs an explicit anchor that text gets for free from the vocabulary.
The unifying claim is an Information–Performance Binding, measured by a Unified Latent Probe (mutual information between latent trajectory and explicit reasoning steps): reasoning accuracy is strictly bounded by the information fidelity the latent chain retains. The strongest counterargument is that this re-tethers latent reasoning to explicit steps — if the latent chain only works when it preserves high MI with a verbal trace, the headline efficiency of going non-verbal is partly an illusion, and the gain is dense supervision rather than the latent medium itself. Either way, it reframes latent-CoT design from "pick an architecture" to "preserve information along the chain."
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does latent state recursion differ mechanistically from chain-of-thought prompting?
- When does provable stability in latent dynamics fail to preserve fidelity?
- How much explicit verbal signal must latent chains retain to perform well?
- Why does textual chain-of-thought avoid the representational drift problem automatically?
- Why does explicit chain-of-thought work as a workaround for feedforward transformers?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
convergent-with: outcome-vs-process pathology in token space reappears as gradient attenuation in latent space
-
Can trajectory structure replace hand-annotated process rewards?
Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?
convergent-with: the move from sparse outcome to dense trajectory signal
-
Can continuous thoughts have tractable likelihoods for sampling and scoring?
Most latent-reasoning methods discard the likelihood and sampling properties that made textual chain-of-thought trainable. Can normalizing flows recover those affordances in continuous thought space while preserving efficiency?
complements: NF-CoT supplies the scorable medium; this note explains what the supervision over that medium must preserve
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- What Makes Effective Supervision in Latent Chain-of-Thought? An Information-Theoretic Analysis
- Hierarchical Reasoning Model
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- LLM Reasoning Is Latent, Not the Chain of Thought
- Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
- CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective
- Training Large Language Models to Reason in a Continuous Latent Space
- Thought Anchors: Which LLM Reasoning Steps Matter?
Original note title
latent chain-of-thought fails by dual collapse — outcome supervision starves gradients along the trajectory and lets the latent space drift, so reasoning accuracy is bounded by the mutual information the latent chain retains