INQUIRING LINE

What are the six types of reasoning steps that appear in chain-of-thought?

This explores a specific claim in the corpus — that reasoning inside chain-of-thought can be sorted into six distinct kinds of steps — and what that taxonomy is for.


This question points at one paper in particular: a framework called PI (prompt intervention) that breaks the reasoning a model does mid-chain into six categories of step, then watches which ones actually matter Can reasoning steps be dynamically pruned without losing accuracy?. The headline result is more interesting than the list itself — using the model's own attention maps, the researchers found that some step types (notably verification and backtracking, the 'let me double-check' and 'wait, that's wrong' moves) barely get looked at by anything downstream. Keeping only the high-attention steps let them strip out roughly 75% of the reasoning while holding accuracy steady. So the six-way split isn't just bookkeeping; it's a way of asking which reasoning moves are load-bearing and which are theater.

That 'theater' worry is exactly where the rest of the corpus gets pointed. Several notes argue that chain-of-thought steps often don't cause the answer they appear to justify: chains can fail both causal sufficiency (the steps don't always matter) and causal necessity (spurious steps creep in) Do language models actually use their reasoning steps?, and in multi-step agent pipelines the apparent quality of a reasoning chain is only weakly correlated with whether the output is right Does chain-of-thought reasoning actually explain AI decisions?. A categorization that flags low-attention step types is, in effect, a tool for finding the parts of the chain that are decorative.

The more surprising thing is that PI's six categories are only one of several competing 'periodic tables' of reasoning the corpus holds. One note classifies whole reasoning topologies — chain, tree, graph — as formal graph types, where the structure isn't a metaphor but determines what the computation can express Can reasoning topologies be formally classified as graph types?. Another models long chains as having a 'molecular bond' structure with three interaction types — deep reasoning, self-reflection, self-exploration — and finds that mixing these from different teacher models destabilizes training Does long chain of thought reasoning follow molecular bond patterns?. A third decomposes CoT performance not by step type at all but by three hidden forces: raw output probability, memorization, and genuinely error-accumulating reasoning What three separate factors drive chain-of-thought performance?.

What you didn't ask but might want: the number of categories is a research choice, not a fact about the model. Six, three, three-by-topology — each taxonomy is built to answer a different question (what to prune, what to train on, what drives the score). And a quieter line in the corpus suggests the whole exercise of categorizing visible steps may be optional: latent-reasoning models solve hard puzzles entirely in hidden computation, with no verbalized steps to categorize at all Can models reason without generating visible thinking steps?. The six types are best read as a map of which spoken reasoning moves earn their keep — not as the anatomy of how the model actually thinks.


Sources 7 notes

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Do language models actually use their reasoning steps?

LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.

Does chain-of-thought reasoning actually explain AI decisions?

Research shows that CoT reasoning quality is weakly correlated with output correctness in agentic pipelines. Chains generate analyzable material that appears coherent but doesn't causally produce outputs, creating false confidence in explainability.

Can reasoning topologies be formally classified as graph types?

CoT, ToT, and GoT map precisely to path graphs, trees, and arbitrary directed graphs respectively. The topology is not metaphorical but defines actual computational structure—GoT's in-degree > 1 enables divide-and-conquer synthesis that trees cannot express.

Does long chain of thought reasoning follow molecular bond patterns?

Deep-Reasoning (covalent), Self-Reflection (hydrogen bonds), and Self-Exploration (van der Waals forces) form stable distributions in effective Long CoT. Mixing these stable structures from different teachers destabilizes learning despite matched performance metrics.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Can models reason without generating visible thinking steps?

Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-audit analyst. The question remains: which reasoning steps in chain-of-thought are *causally load-bearing* versus decorative?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library identified:
- Six step-type categories (verification, backtracking, etc.) via attention-based intervention; ~75% of steps could be pruned without accuracy loss (2025-08, arXiv:2508.02511).
- CoT steps often fail causal sufficiency (steps don't always matter) and necessity (spurious steps appear); intermediate reasoning quality weakly predicts output correctness in agentic pipelines (2025-06, arXiv:2506.09853; 2025-05, arXiv:2505.00875).
- Reasoning topology (chain, tree, graph as formal structures) and "molecular bond" interaction types (deep reasoning, self-reflection, self-exploration) are competing taxonomies; taxonomy choice depends on downstream task, not on model anatomy (2024-01, arXiv:2401.14295; 2026-01, arXiv:2601.06002).
- Hidden-reasoning models solve hard problems with zero verbalized steps, undermining the assumption that visible reasoning is where cognition happens (2025-05, arXiv:2505.00875).

Anchor papers (verify; mind their dates):
- arXiv:2508.02511 (Test-time Prompt Intervention, 2025-08)
- arXiv:2506.09853 (Causal Sufficiency and Necessity, 2025-06)
- arXiv:2601.06002 (Molecular Structure of Thought, 2026-01)
- arXiv:2505.00875 (Thoughts without Thinking, 2025-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 75% pruning claim, the causal-sufficiency gap, and the weak correlation in agentic pipelines: check whether scaling, inference-time orchestration (multi-pass critique, tool use, memory caches), or newer evals have since *relaxed* these limits. Separately: do hidden-reasoning models (o1, o3, internals of Llama thinking) now dominate, or do autoregressive CoT chains still carry the load? Cite what tightened or loosened each finding.
(2) Surface the strongest *contradicting or superseding* work from the last 6 months — papers arguing CoT *is* faithful, or that the step categories are *not* decorative, or that verbalized reasoning is necessary after all.
(3) Propose 2 research questions that assume the regime may have shifted: (a) If hidden reasoning dominates, how do we audit or steer it?; (b) Under what training regimes do visible steps become load-bearing again?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines