Does failed-step fraction predict reasoning quality better?
Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
Across 10 large reasoning models on math and scientific reasoning tasks, a single structural graph metric — Failed-Step Fraction (FSF) — consistently outperforms CoT length and review ratio as a predictor of correctness.
FSF is defined as the fraction of steps belonging to failed exploratory branches in the reasoning graph. A failed branch is a set of reasoning steps that were explored and then abandoned before reaching the final answer. High FSF means the model spent significant effort on dead ends; low FSF means reasoning was mostly direct.
Three converging lines of evidence:
Correlation analysis (conditional on question): shorter reasoning traces are associated with higher accuracy; lower review ratio is associated with higher accuracy; FSF is the strongest and most stable predictor across difficulty strata and all 10 models
Test-time selection: Sampling 64 generations per problem and reranking by each metric shows FSF-based selection yields the largest pass@1 gains (up to 10% on AIME) — outperforming length- or review-based selection
Causal intervention: Directly editing CoT traces to remove failed branches substantially improves accuracy on previously-incorrect traces
The causal mechanism: failed branches do not disappear from the model's context when backtracking occurs. Current models do not fully "unsee" earlier mistakes when exploring new paths. The failed branches bias subsequent exploration, pulling reasoning toward already-rejected directions and compounding errors.
This connects to Which sentences actually steer a reasoning trace? — thought anchors are the positive pivots where reasoning changes direction successfully; FSF is the corresponding negative measure of how much failed exploration is poisoning the context.
The practical implication is concrete: structure-aware test-time scaling (select for low FSF) outperforms indiscriminate scaling (add more tokens, encourage more review). Length and review are proxies for FSF — but noisy ones. The graph structure of reasoning is the real signal.
Inquiring lines that use this note as a source 14
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes diverse failure modes more informative than single failure examples?
- Why do shorter correct reasoning traces contain fewer failed branches?
- How do failed branches remain in context and contaminate subsequent reasoning?
- Can removing failed branches from edited traces improve previous mistakes?
- Does parallel sampling avoid failed-branch contamination more than sequential thinking?
- Can we predict when a specific prompt will fail on a given question?
- What reasoning token threshold marks the accuracy degradation point?
- Why does revision often make reasoning accuracy worse in frontier models?
- Why does failed step fraction predict reasoning quality better than trace length?
- What debugging behaviors signal that a user has abandoned the coding loop?
- What failure modes does the negative-space checklist generation method actually catch?
- Why do wrong numbers cost less accuracy than shuffled reasoning steps?
- Why does enlarging the evaluation unit reintroduce comparability problems?
- What makes financial reasoning particularly vulnerable to general PRM failures?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
FSF is the negative measure of the same phenomenon: failed branches vs. successful pivots
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
FSF explains why: shorter correct traces contain fewer failed branches
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-revision is one mechanism that creates failed branches; FSF captures the accumulated damage
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
parallel sampling avoids failed-branch contamination by exploring independent paths; FSF explains the advantage
-
Do models fail worse when their own errors fill the context?
As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
self-conditioning is the mechanism that makes high FSF toxic: failed branches remain in context and passively contaminate subsequent reasoning — FSF quantifies the degree of contamination, self-conditioning explains why it degrades performance
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
- Reasoning Language Models: A Blueprint
- Break the Chain: Large Language Models Can be Shortcut Reasoners
- Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- When More is Less: Understanding Chain-of-Thought Length in LLMs
Original note title
failed-step fraction is a stronger predictor of reasoning quality than trace length or review ratio