Can extreme task decomposition enable reliable execution at million-step scale?
Can breaking tasks into maximally atomic subtasks with voting-based error correction solve the fundamental reliability problem in long-horizon tasks? This challenges whether better models or better decomposition is the path to high-reliability AI systems.
A system with a 1% per-step error rate is expected to fail after 100 steps of a million-step task. This makes traditional approaches to long-horizon tasks fundamentally infeasible — improving model accuracy from 99% to 99.99% is insufficient for tasks requiring thousands of dependent steps. MAKER (Massively Decomposed Agentic Processes) takes a different approach: instead of improving per-step accuracy, decompose until each step is trivially reliable, then apply error correction.
Three core components:
- Decomposition into minimal subtasks: Each agent handles a single, tiny "micro-role" rather than anthropomorphized human-level roles. By avoiding complex role assignments and instead exploiting the machine-like nature of LLMs, each subtask becomes solvable with high reliability.
- Error correction via subtask-level voting: Multiple agents independently solve the same subtask; voting identifies the correct answer. This is error correction at the finest possible granularity.
- Red-flagging to reduce correlated errors: Detects situations where voting might fail because errors are correlated across agents, and applies additional verification.
The scaling laws are formalized: probability of success and expected cost change predictably with total steps and decomposition level. Under extreme decomposition, effective scaling is feasible; without it, infeasible.
The most counterintuitive finding: state-of-the-art reasoning models are not required. Relatively small non-reasoning models suffice when the decomposition is extreme enough. This inverts the standard approach to hard problems — instead of smarter models, use dumber models on smaller problems.
This extends Does separating planning from execution improve reasoning accuracy? to an extreme: not just separating two functions, but decomposing the entire task into maximally atomic units. It also extends Why does majority voting outperform more complex inference methods? from answer-level voting to subtask-level voting with formalized scaling properties.
The implication for AI deployment: for tasks requiring very high reliability over many steps (organizational processes, scientific experiments, production pipelines), the path may run through decomposition and redundancy rather than through better models.
Inquiring lines that use this note as a source 43
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do integrated and decoupled architectures trade off intervention accuracy for efficiency differently?
- How do larger models maintain more parallel tasks than smaller models?
- What makes some tasks bounded enough for reliable RL?
- What design principles prevent error cascades in multi-step evaluation systems?
- How do autonomous pipelines identify and fix silent bugs in data pipelines?
- Why do error avalanches accelerate in self-training loops without verification?
- How does error propagation limit transformer performance on complex tasks?
- Can adaptive prompt-difficulty allocation compound with architectural efficiency improvements?
- Why do monolithic systems resist autonomous optimization attempts?
- How does error avalanching differ from entropy collapse as a failure mode?
- Can evolutionary approaches avoid the overthinking failure mode of iterative refinement?
- Does population-based evolution transcend the parallel versus sequential compute tradeoff?
- How do correlated errors across agents threaten voting-based error correction systems?
- Can subtask-level voting replace sequential revision for improving long-horizon task accuracy?
- What decomposition level minimizes both error rate and computational cost in practice?
- Does parallel sampling avoid failed-branch contamination more than sequential thinking?
- How do surface statistical regularities enable correct outputs while degrading robustness?
- What three independent failure points bottleneck traditional function calling systems?
- What architectural changes would accelerate the cleanup phase?
- How should compute budgets be allocated across multi-stage RAG architectures?
- Can structured output formats reduce instruction following degradation?
- How should monitoring intensity change based on task criticality?
- Does test-time compute scaling work for agentic deep research tasks?
- Can task decomposition into microagents with voting scale to million-step problems?
- How does task decomposition prevent bias from spreading across therapeutic AI pipelines?
- Can voting work at every level of task decomposition, not just whole problems?
- Can any architecture fundamentally solve problems that require inherently sequential computation?
- How does task structure determine optimal test-time compute allocation?
- Could deploying GPT-4 for everyone require 100 million specialized chips?
- How do traditional quality assurance methods fail for mutable AI outputs?
- How do mode-specific failures differ between completion and agent benchmarks?
- Can model training address failures that really originate in harness gaps?
- What role does consensus merging play in dynamic task decomposition?
- Why does decoupling planning from execution improve over sequential interleaving?
- Can fixed pipelines eliminate planning-time attacks by sacrificing adaptive coordination?
- How does decomposing tasks prevent interference between planning and execution?
- Can we predict which tasks will decompose into modular subnetworks?
- How does error accumulation in workflows scale across multiple model calls?
- What degradation patterns emerge as relay length increases in delegated tasks?
- Can test-time compute fully replace scaling model parameters on hard problems?
- What four domain properties make self-healing failure loops actually work?
- Can intentional data-mixture design replace model scaling for rare task learning?
- Why does externalized state beat parameter scaling for agent reliability?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does separating planning from execution improve reasoning accuracy?
Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.
MAKER takes this principle to its extreme: maximally atomic decomposition
-
Why does majority voting outperform more complex inference methods?
Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
MAKER applies voting at subtask level with formalized scaling laws
-
Do models fail worse when their own errors fill the context?
As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
MAKER addresses this by isolating each step: no error context propagation
-
Are reasoning model collapses really failures of reasoning?
Explores whether language models hit a fundamental reasoning ceiling or whether text-only evaluation masks execution limitations. Examines how tool access might reveal hidden reasoning capabilities.
consistent: execution can be fixed by decomposition without improving reasoning
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
MAKER decomposes externally via multiple agents; TIM decomposes internally via recursive subtask trees within a single model, eliminating the coordination overhead while preserving the decomposition principle
-
When does adding more agents actually help systems?
Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
quantifies when MAKER's extreme decomposition helps vs. hurts: token budget fragmentation under multi-agent coordination trades off against tool complexity, and centralized coordination contains error amplification to 4.4x vs. 17.2x for independent agents
-
Can multi-agent teams automatically remove their weakest members?
Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.
contrasting approach: MAKER uses static decomposition with redundancy-based error correction; DyLAN uses dynamic pruning with contribution-based scoring; MAKER optimizes at design time (decomposition level), DyLAN at runtime (agent selection)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Solving a Million-Step LLM Task with Zero Errors
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- How Many Instructions Can LLMs Follow at Once?
- OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
- Reasoning Can Hurt the Inductive Abilities of Large Language Models
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- Faith and Fate: Limits of Transformers on Compositionality
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Original note title
extreme task decomposition into microagents with voting enables error-free execution at million-step scale