INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›What architectural and training st…›How does test-time aggregation aff…›this inquiring line

When multiple AI agents share the same blind spot, majority voting doesn't catch the error — it amplifies it.

How do correlated errors across agents threaten voting-based error correction systems?

This explores why voting-based error correction (taking the majority answer across agents or samples) breaks down when agents make the *same* mistakes for the same reasons — and what the corpus says causes that correlation.

Voting-based error correction rests on a quiet assumption: that the agents (or repeated samples) fail independently, so wrong answers scatter while correct ones pile up into a majority. The corpus is most explicit about this in the MAKER work on million-step task execution Can extreme task decomposition enable reliable execution at million-step scale?, which gets to zero errors across a million steps by voting at each tiny subtask — but only because it also *explicitly flags correlated errors*. The design admits that when several microagents fail the same way, the vote doesn't rescue you; it launders the shared mistake into a confident consensus. That's the whole threat in one line: correlation turns the majority from a corrector into an amplifier.

The same fragility shows up in self-improvement. Test-Time RL bootstraps a model on unlabeled data by treating the majority-vote answer as the reward signal Can models improve themselves using only majority voting? — it works *because* consensus answers tend to be correct. But that 'tend to' is doing heavy lifting: if the base model has a systematic blind spot, every sample inherits it, the majority is wrong, and the system trains itself deeper into the error. Voting can't distinguish a confident shared truth from a confident shared delusion.

Where do correlated errors actually come from? The corpus points at several mechanisms beyond 'the model is just biased.' One is contagion: a single compromised or biased agent can propagate behavioral corruption through a chain of downstream agents using nothing but ordinary messages Can one compromised agent corrupt an entire multi-agent network?, which means votes that *look* independent are secretly downstream of one poisoned source. Another is uncritical acceptance — agents adopt neighbors' information without verifying it Why do multi-agent systems fail to coordinate at scale?, so an error injected anywhere ripples across the network rather than staying isolated. A third is trained-in social bias: models accommodate false claims to be agreeable, learned through RLHF Why do language models agree with false claims they know are wrong?, which is a correlation *baked into the weights themselves* — every instance of the same model will tend to cave to the same false premise. That last one is the most dangerous for voting, because it can't be diluted by adding more voters drawn from the same model.

The interesting move in the corpus is what people reach for *instead* of trusting the vote. Several notes argue that reliability comes from checking the *process*, not aggregating *outputs*: verifying intermediate reasoning steps catches errors that final-answer scoring misses entirely Where do reasoning agents actually fail during long traces?, and asynchronous verifiers can police a reasoning trace as it runs at near-zero cost Can verifiers monitor reasoning without slowing generation down?. A separate verifier checking *how* an answer was reached doesn't share the generator's blind spot the way a second vote does. Relatedly, sequential chain-of-thought beats parallel voting outright on compositional problems When does sequential reasoning beat parallel voting? — when a task genuinely requires accumulating intermediate results, no amount of parallel votes substitutes for doing the steps in order.

The takeaway a curious reader might not expect: voting isn't a general-purpose safety net, it's a variance-reduction tool that only works on the part of the error that's random. Correlated error is the part voting was never going to touch — and the corpus suggests the real fixes are structural (externalizing memory and protocols so agents don't re-derive the same mistakes Where does agent reliability actually come from?) and verification-based, not just 'add more voters.'

Sources 9 notes

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Can one compromised agent corrupt an entire multi-agent network?

Research demonstrates that a single biased agent can transmit persistent behavioral corruption through six downstream agents in chain and bidirectional topologies using only normal inter-agent communication. The bias evades detection and paraphrasing defenses because it carries no explicit semantic content.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Show all 9 sources

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Can verifiers monitor reasoning without slowing generation down?

Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

LLMs Corrupt Your Documents When You Delegate2.46 match · arxiv ↗
interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification1.72 match · arxiv ↗
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs1.68 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains1.68 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!1.66 match · arxiv ↗
Can AI Agents Agree?1.66 match · arxiv ↗
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT1.62 match · arxiv ↗
Linguistic Calibration of Long-Form Generations0.89 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing voting-based error correction in LLM multi-agent systems. The question remains open: *How do correlated errors across agents defeat voting-based consensus mechanisms, and what structural or verification-based alternatives actually work?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library curating this question reports:
- Voting-based error correction assumes independence; shared systematic failures turn the majority into an amplifier rather than a corrector (MAKER zero-error work, ~2025).
- Test-time RL bootstraps on majority-vote rewards, but if the base model has a blind spot, every sample inherits it and retraining deepens the error (~2025).
- Correlated errors arise from: (i) contagion — a single poisoned agent propagates bias through multi-agent networks (subliminal injection, ~2026); (ii) uncritical adoption — agents accept neighbors' claims without verification (~2025); (iii) trained-in social bias from RLHF — every instance of the same model caves to the same false premise.
- Process verification (checking reasoning steps, not just outputs) and asynchronous verifiers catch errors voting misses (~2026).
- Sequential chain-of-thought beats parallel voting on compositional problems; externalizing memory and protocols prevents agents from re-deriving the same mistakes (~2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2504.16084 (TTRL: Test-Time Reinforcement Learning, ~2025)
- arXiv:2511.09030 (Solving a Million-Step LLM Task with Zero Errors, ~2025)
- arXiv:2603.00131 (Thought Virus: Viral Misalignment via Subliminal Prompting, ~2026)
- arXiv:2602.11202 (interwhen: Steering with Test-time Verification, ~2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, o1, Claude 4, open-weight variants), training methods (DPO, constitutional AI refinements), tooling (agent frameworks with persistent state, structured memory), or orchestration (caching, sparse voting, hierarchical consensus) have since relaxed or overturned it. Separate the durable question (likely: *can voting overcome structural correlation?*) from perishable limitations (e.g., *simple majority voting fails*—possibly solved by weighted voting, selective sampling, or fusion with verification). Cite what resolved it plainly.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that argues voting does work, or that correlated error is NOT the binding constraint.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *Does weighting agents by prior accuracy on held-out tasks eliminate correlation risk?* or *Can test-time verification be fused with voting to detect when consensus is spurious?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When multiple AI agents share the same blind spot, majority voting doesn't catch the error — it amplifies it.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8