SYNTHESIS NOTE

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques

Across 10 large reasoning models on math and scientific reasoning tasks, a single structural graph metric — Failed-Step Fraction (FSF) — consistently outperforms CoT length and review ratio as a predictor of correctness.

FSF is defined as the fraction of steps belonging to failed exploratory branches in the reasoning graph. A failed branch is a set of reasoning steps that were explored and then abandoned before reaching the final answer. High FSF means the model spent significant effort on dead ends; low FSF means reasoning was mostly direct.

Three converging lines of evidence:

Correlation analysis (conditional on question): shorter reasoning traces are associated with higher accuracy; lower review ratio is associated with higher accuracy; FSF is the strongest and most stable predictor across difficulty strata and all 10 models
Test-time selection: Sampling 64 generations per problem and reranking by each metric shows FSF-based selection yields the largest pass@1 gains (up to 10% on AIME) — outperforming length- or review-based selection
Causal intervention: Directly editing CoT traces to remove failed branches substantially improves accuracy on previously-incorrect traces

The causal mechanism: failed branches do not disappear from the model's context when backtracking occurs. Current models do not fully "unsee" earlier mistakes when exploring new paths. The failed branches bias subsequent exploration, pulling reasoning toward already-rejected directions and compounding errors.

This connects to Which sentences actually steer a reasoning trace? — thought anchors are the positive pivots where reasoning changes direction successfully; FSF is the corresponding negative measure of how much failed exploration is poisoning the context.

The practical implication is concrete: structure-aware test-time scaling (select for low FSF) outperforms indiscriminate scaling (add more tokens, encourage more review). Length and review are proxies for FSF — but noisy ones. The graph structure of reasoning is the real signal.

Inquiring lines that read this note 15

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can AI systems learn from failures without cascading errors?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Do corrupted reasoning traces serve as effective supervision signals?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Does parallel sampling avoid failed-branch contamination more than sequential thinking?

Can prompting strategies overcome LLM biases without model fine-tuning?

Can we predict when a specific prompt will fail on a given question?

When do additional thinking tokens stop improving reasoning performance?

What reasoning token threshold marks the accuracy degradation point?

Why does self-revision increase model confidence while degrading accuracy?

Why does revision often make reasoning accuracy worse in frontier models?

How does AI assistance affect human cognitive development and reasoning autonomy?

What debugging behaviors signal that a user has abandoned the coding loop?

How do prompt structure and constraints affect model instruction reliability?

What failure modes does the negative-space checklist generation method actually catch?

Can ensemble evaluation methods reduce bias more than single judges?

Why does enlarging the evaluation unit reintroduce comparability problems?

Can single-axis benchmarks accurately predict agent deployment success?

What capability dimensions does a single aggregate pass rate hide?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 175 in 2-hop network ·dense cluster Open in graph ↗

Does failed-step fraction predict reasoning qual… Which sentences actually steer a reasoning trace? Why do correct reasoning traces contain fewer toke… Does self-revision actually improve reasoning in l… Why does parallel reasoning outperform single chai… Do models fail worse when their own errors fill th…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
FSF is the negative measure of the same phenomenon: failed branches vs. successful pivots
Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
FSF explains why: shorter correct traces contain fewer failed branches
Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-revision is one mechanism that creates failed branches; FSF captures the accumulated damage
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
parallel sampling avoids failed-branch contamination by exploring independent paths; FSF explains the advantage
Do models fail worse when their own errors fill the context? As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
self-conditioning is the mechanism that makes high FSF toxic: failed branches remain in context and passively contaminate subsequent reasoning — FSF quantifies the degree of contamination, self-conditioning explains why it degrades performance

Does failed-step fraction predict reasoning quality better?

Inquiring lines that read this note 15

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4