What role does search capacity play in making debate more accurate?
This explores whether the ability to search for and verify external evidence is what actually makes multi-agent debate more accurate — as opposed to debate improving accuracy on its own.
This reads the question as asking what role external search and evidence-gathering play in debate accuracy — and the corpus has a sharp answer: search capacity isn't a nice-to-have add-on, it's the load-bearing ingredient. Multi-agent debate only reliably improves reasoning on tasks where claims can be checked against external evidence; in contested domains without that checking, debate reverses into a false-consensus generator where the most persuasive framing wins over the most correct one When does debate actually improve reasoning accuracy?. So the question almost inverts: it's less that search makes debate accurate, and more that without search, debate has no anchor and amplifies errors.
Why does debate fail without that anchor? Because LLM debates settle disagreements differently than people do. Human debates resolve through argument quality, reputation, track record, and social authority; AI debates resolve through chain-of-thought probability ranking, with no access to the social world where expertise is built How do LLM debates differ from human expert consensus?. A model processing only text can't tell an expert's argument from a confidently-stated common assumption Can language models distinguish expert arguments from common assumptions?. Search capacity substitutes an external ground truth for the social grounding the model lacks — it's the verification channel that decides which of two fluent arguments is actually right.
Interestingly, the corpus suggests search behaves like a tunable resource, not a binary. Agentic deep research shows a test-time scaling law: answer quality climbs with search iterations along a monotonic-but-diminishing-returns curve, identical in shape to scaling reasoning tokens — meaning models can trade reasoning budget against search budget to optimize accuracy Does search budget scale like reasoning tokens for answer quality?. But more search isn't free: long-horizon search degrades when each turn over-reasons and burns the context needed to absorb the next round of evidence, so capping per-turn reasoning (not just total time) preserves search quality across iterations Does limiting reasoning per turn improve multi-turn search quality?.
There's also a cautionary thread about confusing the appearance of search with its substance. When humans judge AI answers, citation count works as a decoupled trust heuristic — irrelevant citations boost user preference almost as much as relevant ones Do users trust citations more when there are simply more of them?. That's the failure mode lurking behind "search makes debate accurate": searching and citing more can manufacture the felt confidence of verification without delivering the substance. The accuracy gain comes specifically from evidence that actually adjudicates the disagreement, not from the volume of retrieval.
The quiet payoff: if you can't give debaters real search, you can sometimes engineer verification structurally instead. A leader-follower protocol where one agent proposes interpretations and rotating followers challenge them lets even a small 7B model hit 76.7% on ambiguity detection — role rotation and forced consensus create stronger internal verification than pairwise debate, partly closing the gap that external search would otherwise have to fill Can structured debate roles help small models detect ambiguity?.
Sources 7 notes
Multi-agent debate boosts accuracy on verifiable tasks like math and logic, but reverses in contested domains without external evidence checking. Without verification, persuasive framing wins over correctness, making debate a false-consensus generator rather than accuracy amplifier.
Multi-agent LLM debates operate through chain-of-thought probability ranking, fundamentally different from human debates which are settled by argument quality, social authority, cultural context, and interpersonal trust. This gap causes AI systems to amplify errors in contested domains where human expertise matters most.
LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.
Mistral-7B achieved 76.7% accuracy in ambiguity detection through a protocol where a leader proposes interpretations and two followers challenge them with rotating roles. Role rotation and consensus forcing prevent persuasive framing failures and create stronger verification than pairwise debate.