Can reasoning systems forget history without losing coherence?
Does treating each reasoning step as independent—rather than accumulating historical context—actually preserve problem-solving quality while reducing computational waste? This explores whether Markov-style memoryless reasoning can scale effectively.
Existing test-time scaling methods all carry history along. Chain-based methods preserve the entire reasoning trace to generate each next step. Tree-based methods track ancestor and sibling relationships across branches. Graph-based methods compound this with arbitrary node dependencies. As reasoning scales, the accumulated historical dependencies waste compute and — worse — interfere with the model's ability to reason effectively on the current state.
Atom of Thoughts (2502.12018) makes a different bet: each reasoning state should be a simplified problem equivalent to the original, with partial reasoning steps either transformed into known conditions or excluded as incorrect explorations. The state transition mechanism has two phases. First, decompose the current question into a dependency-based directed acyclic graph (DAG) capturing structural information. Second, contract the subquestions into a new independent question. Iterate the decomposition-contraction until reaching directly-solvable atomic questions.
The Markov property is the load-bearing claim. Each transition depends only on the current state — never on the path that produced it. This is not a heuristic; it is a structural property guaranteed by answer-equivalence preservation through contraction. If the contracted question yields the same answer as the original, no historical context is required to continue.
The cognitive science motivation is direct. Humans solve complex problems by identifying and resolving self-evident subquestions, then reformulating a simplified problem state — not by maintaining detailed reasoning processes for resolved components. The reformulation IS the memory management.
Two architectural advantages emerge. AoT eliminates the need for maintaining and computing historical information when scaling test-time compute, and atomic questions can be seamlessly integrated into existing TTS frameworks as a plug-in enhancement. Since Can recursive subtask trees overcome context window limits?, AoT is the language-level version of the same insight — TIMRUN prunes KV cache to free positional embeddings; AoT contracts subproblems to free conceptual context. Both reject the assumption that more history equals better reasoning.
Inquiring lines that use this note as a source 96
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do one-shot transparency studies miss the temporal reversal entirely?
- How can we measure whether assistance preserved the user's reasoning state?
- How do archive systems handle knowledge that changes with each generation?
- What design changes could make constraint inference more reliable without explicit cuing?
- Why does step-by-step reasoning fail when tool outputs get very large?
- Does irrelevant content degrade reasoning even when it fits the context window?
- Can context compression preserve what matters without introducing bias?
- What makes a background condition relevant to a specific reasoning task?
- Does psychological continuity require uninterrupted consciousness or restored context?
- Why does selective context retrieval outperform including all historical information?
- How does scene-switching prevent cross-problem interference in multi-agent reasoning?
- Does inference-time compute scaling require explicit reasoning traces or verifiable rewards?
- Can parallel thinking outperform sequential thinking under the same token budget?
- Can adaptive compute distribution across prompts replace the need for sophisticated reasoning frameworks?
- Does parallel sampling avoid failed-branch contamination more than sequential thinking?
- When does explicit reasoning actually degrade performance on a task?
- How do hierarchical architectures separate planning from retrieval differently than flat ones?
- Does irrelevant context degrade reasoning even within model context limits?
- How should iterative research tasks limit context per reasoning turn?
- Can parallel independent reasoning outperform sequential iterative refinement?
- How does compressing memory between iterations prevent overthinking?
- Can minimal adversarial triggers disrupt reasoning across multiple unrelated queries?
- Can layer-wise KV caches enable truly lossless information transfer?
- How would you redesign context integration to prevent prior associations from dominating?
- Can external summarization solve exploration problems in complex real-world environments?
- Does logical trace coherence guarantee valid mathematical reasoning?
- What makes memory trajectories topologically stable under persistent reuse?
- Does reflection destabilize reasoning in dynamic environments?
- Can parallel reasoning chains outperform longer sequential chains with the same compute?
- Why does parallel thinking outperform sequential thinking under the same token budget?
- Can any architecture fundamentally solve problems that require inherently sequential computation?
- How do insert, forget, and merge operations maintain thought coherence over time?
- Why does the same recalled information lead to different reasoning conclusions?
- Can post-thinking compute on memory reduce query-time reasoning costs?
- Can hierarchical vector routing reduce context overhead while maintaining tool coverage?
- How does the functional separation of knowledge and reasoning affect adaptation methods?
- When should a system decide to retrieve versus reason alone?
- Why do reasoning chains degenerate into undirected exploration at scale?
- How does separating decomposition from execution improve multi-step reasoning?
- What makes parallel thinking more efficient than sequential chains?
- Do reasoning systems reuse cognitive structures across unrelated topics?
- Can explicit optimal algorithms prevent reasoning model collapse at high complexity?
- Can recursive sub-calls decompose reasoning across multiple context chunks?
- What is the cost difference between filtering context versus attending to everything?
- Why do linear research pipelines lose global context across planning and generation steps?
- Why do long-horizon reasoning tasks need per-turn step limits rather than just compute budgets?
- How does trace coherence differ from valid mathematical proof in practice?
- What makes a problem fundamentally sequential versus parallelizable?
- What persistent memory architectures best support storing precomputed inferences across sessions?
- How does precomputing context reasoning reduce latency in stateful applications?
- What details do high-level trajectory abstractions lose that state-grounded recall preserves?
- Why does partial observability require interaction instead of better reasoning?
- Does unrestricted reasoning per search step degrade iterative quality over time?
- Can selective history filtering address topic drift that generation-time topic following cannot prevent?
- Can historical and batch exploration be implemented with the same algorithmic mechanism?
- How should token budgets be set to prevent runaway oscillation during inference?
- What computational cost does trajectory-bursty inference impose on per-query context requirements?
- Can instance-adaptive reasoning happen without sequential token dependencies?
- Are some problems fundamentally unsolvable by parallel inference methods?
- How can prompt intervention reduce redundant reasoning steps dynamically?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- How does decoupling reasoning from tool observations improve parallel execution?
- Does algorithmic decomposition prevent planning-execution interference in reasoning?
- What distinguishes formation, evolution, and retrieval as separate memory dynamics?
- How does context budget create tradeoffs between memory and skills?
- Why does the hot-path cold-path split map onto formation and evolution?
- What makes structured memory schemas more stable than freeform text summaries?
- What role does consensus merging play in dynamic task decomposition?
- Why does a replay mechanism prevent reasoner skills from over-specializing?
- Does compressing all past memories into one representation lose irretrievable details?
- How does separating local and global context dependencies affect long-context performance?
- Can episodic raw memory outperform consolidated summaries in practice?
- Does decoupling reasoning reduce inference cost more than sequential scaling?
- Can memory and test-time compute scale together as a single axis?
- Why does per-step deliberation lose global perspective compared to dynamic discovery?
- Why does decoupling planning from execution improve over sequential interleaving?
- Can test-time scaling work through retrieval rather than reasoning?
- What gets lost when we describe memory as retrieval?
- Why does parallel sampling become more efficient when reasoning branches are memoryless?
- How do KV cache pruning and subproblem contraction both free reasoning capacity?
- What makes naive memory consolidation regress below having no memory at all?
- Can bounded workspaces prevent overthinking better than summarization alone?
- What makes answer equivalence sufficient to discard a reasoning path?
- How does decomposing tasks prevent interference between planning and execution?
- Can stateless multi-step retrieval capture evidence integration as well as dynamic memory?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- How do prior errors in context history amplify future failures over time?
- Why do aggregation tasks degrade faster than multi-hop reasoning under sparsity?
- Can memory workspaces resolve contradictory evidence that stateless systems miss?
- What makes timestamped knowledge repositories better than static memory?
- Can models consolidate context into weights during idle offline phases?
- When should architects prioritize consolidation compute over larger context windows?
- Does including full context always degrade memory retrieval quality in practice?
- How do memory hierarchies and compression reduce context management demands?
- What role do cyclic fixed points play in stable reasoning?
- How does structured environment state compare to transcript replay for multi-turn reasoning?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
TIMRUN does the same at the KV-cache layer; AoT does it at the conceptual layer; both reject history-accumulating reasoning
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
AoT can compose with parallel sampling once each branch is memoryless
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
AoT is the structural fix that PDR's bounded-workspace also targets: bounded state via contraction rather than via summarization
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Atom of Thoughts for Markov LLM Test-Time Scaling
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- A Decomposition Perspective to Long-context Reasoning for LLMs
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- Test-time Prompt Intervention
Original note title
markov-style memoryless reasoning replaces accumulated-history test-time scaling with iterative decompose-then-contract