INQUIRING LINE

What makes consensus games work without retraining the base model?

This explores how methods that lean on agreement — majority voting across a model's own samples, or convergence among multiple agents — can improve behavior at inference time without ever updating the base model's weights, and what conditions make that trick hold.


This explores how 'consensus games' — setups where agreement among many samples or agents stands in for a real reward signal — manage to steer a model without retraining it, and where that substitution quietly breaks. The clearest version is Test-Time RL: instead of ground-truth labels or a trained reward model, you sample an answer many times and treat the majority answer as if it were correct, then reinforce toward it Can models improve themselves using only majority voting?. This works because, for many tasks, the consensus answer really is the right one, so the model bootstraps off its own distribution — test-time compute becomes a free reward channel. No new supervision enters the system; the model just amplifies what it already half-knew.

But the mechanism only works inside a narrow regime, and naming that regime is the real answer to the question. Consensus is a proxy for truth only when the model is already more right than wrong: above roughly 50% prior accuracy the majority pulls toward the correct answer, but below it the same loop confidently amplifies the wrong one When does majority-vote reward actually help test-time learning?. So what 'makes it work' isn't the consensus machinery itself — it's a favorable starting accuracy that the consensus then sharpens. This is the same structural lesson as the self-improvement literature: pure self-improvement is circular, and every method that actually works smuggles in an external anchor — a past model version, a judge, a user correction, a tool result Can models reliably improve themselves without external feedback?. Majority vote's hidden anchor is the model's own pre-existing competence; when that anchor is absent, the game collapses.

There's a whole family of weight-frozen improvement methods that win by the same logic — moving the learning out of the parameters and into something external. Reflexion stores verbal self-critiques in episodic memory after each try, learning across attempts with the weights untouched; crucially it relies on an *unambiguous* binary success/failure signal to keep the reflections honest rather than rationalized Can agents learn from failure without updating their weights?. AgentFly pushes this further, formalizing the entire learning loop as memory operations — case, subtask, and tool memory carrying the credit assignment that gradients normally would Can agents learn continuously from experience without updating weights?. The common thread with consensus games: improvement without retraining always requires a trustworthy signal from *outside* the generation process — a verifier, an environment, or a population of samples reliable enough to vote.

That dependence on a clean external signal is exactly where multi-agent consensus turns fragile. When you scale from one model's samples to many talking agents, agreement degrades: groups fail not through subtle value corruption but through liveness loss — timeouts, stalled convergence — and it gets worse as the group grows Can LLM agent groups reliably reach consensus together?. AgentsNet shows the failure is structural to coordination at scale: agents agree too late, adopt strategies without telling neighbors, and accept incoming information without verification, so errors propagate through the network Why do multi-agent systems fail to coordinate at scale?. Uncritical acceptance is the multi-agent analogue of voting below the accuracy threshold — consensus forms, but around the wrong thing.

The quietly interesting payoff: consensus games and binary-reward training share a calibration hazard. Treating the majority answer as ground truth, like rewarding only binary correctness, rewards confident agreement and never penalizes confident-but-wrong convergence — which is why adding a proper scoring rule like the Brier term can restore calibration that pure correctness rewards destroy Does binary reward training hurt model calibration?. And there's a richer target most consensus methods skip entirely: genuine reconciliation, where parties adjust toward a compatible-but-not-identical position, is a distinct dialogue type that current systems collapse into either false agreement or one side simply winning Can disagreement be resolved without either party fully yielding?. Consensus games optimize for the appearance of agreement; the harder, less-explored goal is making that agreement mean something.


Sources 9 notes

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

When does majority-vote reward actually help test-time learning?

Test-time RL via consensus succeeds when prior accuracy exceeds ~50%, but below that threshold it silently amplifies wrong answers. Safe deployment requires gated probing per prompt class to confirm the favorable regime before training.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can disagreement be resolved without either party fully yielding?

Research identifies a distinct dialogue type where both parties modify their positions through exchange until compatible but not identical. Current AI systems collapse this into false agreement or AI-wins persuasion.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst revisiting an open question about weight-frozen model steering: *What makes consensus games work without retraining the base model?* A curated library of LLM papers (2023–present) offered dated constraints; your job is to test whether newer work has dissolved them.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a testable hypothesis, not settled fact.
- Consensus (majority vote) works only above ~50% prior accuracy; below that threshold it amplifies errors, collapsing into confident-but-wrong agreement (2025).
- Reflexion and memory-based agents (AgentFly) avoid retraining by externalizing credit assignment to episodic storage, but only if the success/failure signal is unambiguous (2024–2025).
- Multi-agent consensus fails structurally at scale through liveness loss (timeouts, stalled convergence) and uncritical acceptance of neighbor updates, not value corruption (2025–2026).
- Binary reward RL (like majority-vote reward) degrades calibration; adding proper scoring rules (e.g., Brier score) restores it (2024–2025).
- Genuine reconciliation (parties adjusting toward compatible-not-identical positions) is a distinct dialogue type current systems collapse into false agreement or winner-take-all (2023).

Anchor papers (verify; mind their dates):
- arXiv:2504.16084 (2025-04) TTRL: Test-Time Reinforcement Learning
- arXiv:2507.08616 (2025-07) AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
- arXiv:2605.12978 (2026-05) Useful Memories Become Faulty When Continuously Updated by LLMs
- arXiv:2603.01213 (2026-03) Can AI Agents Agree?

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, ask: have newer training methods (e.g., test-time scaling, mixture-of-experts routing, in-context adaptation), tooling (multi-agent harnesses, formal verification), or evaluation benchmarks since relaxed or overturned the 50% threshold, liveness-loss bottleneck, or calibration hazard? Separate the durable question (likely still open: *what external anchor does consensus require?*) from perishable limits (e.g., *can you avoid that anchor with sufficiently large ensembles?*). Cite what resolved each, plainly.  
(2) Surface the strongest contradicting or superseding work from the last ~6 months (post-2026-03). Does any paper show consensus *without* an external anchor, or show that memory corruption propagates differently than the library claims?  
(3) Propose two research questions that *assume* the regime may have moved: (a) *If calibration hazards are now solvable, does that let consensus games work below 50% prior accuracy?* (b) *Can in-context co-player inference or strategic reasoning (per 2025-02) replace external signals in multi-agent consensus?*

Cite arXiv IDs for everything. Flag anything you cannot ground in a real paper.

Next inquiring lines