SYNTHESIS NOTE

Why does latent chain-of-thought fail so easily in training?

Explores why latent reasoning is fragile compared to textual chain-of-thought, focusing on how outcome-only supervision creates gradient starvation and representational drift in learned reasoning trajectories.

Synthesis note · 2026-06-27 · sourced from Cognitive Models Latent

Why is latent chain-of-thought so hard to train robustly when textual CoT is comparatively easy? This paper gives an information-theoretic answer: latent CoT fails by a dual collapse. Outcome-only supervision (reward the final answer, ignore the trajectory) produces (1) gradient attenuation along the optimization path — the signal is too far from the latent steps to shape them — and (2) representational drift, where the latent trajectory wanders without a semantic tether. The fix decomposes into two complementary axes: Trajectory Supervision (inject dense stepwise signal) and Space Supervision (preserve the geometry of the latent manifold). The sharp, non-obvious finding is that how you do space supervision matters: rigid geometric compression collapses the high-dimensional reasoning manifold onto sparse static points, while generative reconstruction acts as a flexible semantic anchor that preserves intrinsic dimensionality.

This connects two threads that rarely meet. The trajectory-supervision half is the latent-space analogue of the process-vs-outcome reward debate the vault already holds. Since Why do outcome-based reward models fail at intermediate step evaluation?, outcome supervision is known to underserve intermediate steps in token space — here the same pathology appears in latent space, as gradient attenuation. And Can trajectory structure replace hand-annotated process rewards? echoes the move from sparse outcome to dense process signal without expensive annotation. The second half explains why latent reasoning is harder than verbal: the medium of the latent chain — which Can continuous thoughts have tractable likelihoods for sampling and scoring? tries to make scorable — has no built-in semantic floor, so it needs an explicit anchor that text gets for free from the vocabulary.

The unifying claim is an Information–Performance Binding, measured by a Unified Latent Probe (mutual information between latent trajectory and explicit reasoning steps): reasoning accuracy is strictly bounded by the information fidelity the latent chain retains. The strongest counterargument is that this re-tethers latent reasoning to explicit steps — if the latent chain only works when it preserves high MI with a verbal trace, the headline efficiency of going non-verbal is partly an illusion, and the gain is dense supervision rather than the latent medium itself. Either way, it reframes latent-CoT design from "pick an architecture" to "preserve information along the chain."

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Why does latent chain-of-thought fail so easily … Why do outcome-based reward models fail at interme… Can trajectory structure replace hand-annotated pr… Can continuous thoughts have tractable likelihoods…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
convergent-with: outcome-vs-process pathology in token space reappears as gradient attenuation in latent space
Can trajectory structure replace hand-annotated process rewards? Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?
convergent-with: the move from sparse outcome to dense trajectory signal
Can continuous thoughts have tractable likelihoods for sampling and scoring? Most latent-reasoning methods discard the likelihood and sampling properties that made textual chain-of-thought trainable. Can normalizing flows recover those affordances in continuous thought space while preserving efficiency?
complements: NF-CoT supplies the scorable medium; this note explains what the supervision over that medium must preserve

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

What Makes Effective Supervision in Latent Chain-of-Thought? An Information-Theoretic Analysis0.87 match · arxiv ↗
Hierarchical Reasoning Model0.84 match · arxiv ↗
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs0.84 match · arxiv ↗
LLM Reasoning Is Latent, Not the Chain of Thought0.83 match · arxiv ↗
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?0.82 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective0.82 match · arxiv ↗
Training Large Language Models to Reason in a Continuous Latent Space0.82 match · arxiv ↗
Thought Anchors: Which LLM Reasoning Steps Matter?0.82 match · arxiv ↗

Original note title

latent chain-of-thought fails by dual collapse — outcome supervision starves gradients along the trajectory and lets the latent space drift, so reasoning accuracy is bounded by the mutual information the latent chain retains

Why does latent chain-of-thought fail so easily in training?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4