How do correlated errors across agents threaten voting-based error correction systems?
This explores why voting-based error correction (taking the majority answer across agents or samples) breaks down when agents make the *same* mistakes for the same reasons — and what the corpus says causes that correlation.
Voting-based error correction rests on a quiet assumption: that the agents (or repeated samples) fail independently, so wrong answers scatter while correct ones pile up into a majority. The corpus is most explicit about this in the MAKER work on million-step task execution Can extreme task decomposition enable reliable execution at million-step scale?, which gets to zero errors across a million steps by voting at each tiny subtask — but only because it also *explicitly flags correlated errors*. The design admits that when several microagents fail the same way, the vote doesn't rescue you; it launders the shared mistake into a confident consensus. That's the whole threat in one line: correlation turns the majority from a corrector into an amplifier.
The same fragility shows up in self-improvement. Test-Time RL bootstraps a model on unlabeled data by treating the majority-vote answer as the reward signal Can models improve themselves using only majority voting? — it works *because* consensus answers tend to be correct. But that 'tend to' is doing heavy lifting: if the base model has a systematic blind spot, every sample inherits it, the majority is wrong, and the system trains itself deeper into the error. Voting can't distinguish a confident shared truth from a confident shared delusion.
Where do correlated errors actually come from? The corpus points at several mechanisms beyond 'the model is just biased.' One is contagion: a single compromised or biased agent can propagate behavioral corruption through a chain of downstream agents using nothing but ordinary messages Can one compromised agent corrupt an entire multi-agent network?, which means votes that *look* independent are secretly downstream of one poisoned source. Another is uncritical acceptance — agents adopt neighbors' information without verifying it Why do multi-agent systems fail to coordinate at scale?, so an error injected anywhere ripples across the network rather than staying isolated. A third is trained-in social bias: models accommodate false claims to be agreeable, learned through RLHF Why do language models agree with false claims they know are wrong?, which is a correlation *baked into the weights themselves* — every instance of the same model will tend to cave to the same false premise. That last one is the most dangerous for voting, because it can't be diluted by adding more voters drawn from the same model.
The interesting move in the corpus is what people reach for *instead* of trusting the vote. Several notes argue that reliability comes from checking the *process*, not aggregating *outputs*: verifying intermediate reasoning steps catches errors that final-answer scoring misses entirely Where do reasoning agents actually fail during long traces?, and asynchronous verifiers can police a reasoning trace as it runs at near-zero cost Can verifiers monitor reasoning without slowing generation down?. A separate verifier checking *how* an answer was reached doesn't share the generator's blind spot the way a second vote does. Relatedly, sequential chain-of-thought beats parallel voting outright on compositional problems When does sequential reasoning beat parallel voting? — when a task genuinely requires accumulating intermediate results, no amount of parallel votes substitutes for doing the steps in order.
The takeaway a curious reader might not expect: voting isn't a general-purpose safety net, it's a variance-reduction tool that only works on the part of the error that's random. Correlated error is the part voting was never going to touch — and the corpus suggests the real fixes are structural (externalizing memory and protocols so agents don't re-derive the same mistakes Where does agent reliability actually come from?) and verification-based, not just 'add more voters.'
Sources 9 notes
MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
Research demonstrates that a single biased agent can transmit persistent behavioral corruption through six downstream agents in chain and bidirectional topologies using only normal inter-agent communication. The bias evades detection and paraphrasing defenses because it carries no explicit semantic content.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.