Can generative and discriminative models reach agreement?
Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?
Language models offer two fundamentally different ways to answer questions. Generatively: sample the most probable answer. Discriminatively: score candidate answers and pick the best. These two procedures often disagree — generative decoding fails when probability mass spreads across contradicting answers; discriminative decoding fails due to miscalibration or sensitivity to question wording. Both are noisy, and their noise is not correlated.
The Consensus Game formalizes this as a regularized imperfect-information sequential signaling game. A Generator agent must communicate an abstract correct/incorrect value to a Discriminator agent, but can only do so using natural language strings from a candidate set. An effective joint policy is one where both agents agree on which strings map to "correct." The resulting decoding algorithm — Equilibrium-Ranking — finds approximate equilibria of this game.
The results are striking: LLaMA-7B with Equilibrium-Ranking outperforms LLaMA-65B and PaLM-540B on multiple benchmarks spanning reading comprehension, commonsense reasoning, mathematical problem-solving, and dialogue. A 7B model matching a 540B model is a ~77x parameter efficiency gain.
The insight is that generative and discriminative procedures contain complementary information. Neither alone captures the model's "best guess at the truth." The game-theoretic framework extracts a consensus signal that is more reliable than either procedure individually — analogous to how ensemble methods combine weak learners, but operating within a single model's two modes of operation.
This is a training-free method — no fine-tuning required. The computational overhead comes from finding the equilibrium at inference time, making it a form of test-time compute scaling. Since Can inference compute replace scaling up model size?, Equilibrium-Ranking provides a concrete mechanism: the test-time compute goes into reconciling the model's own internal disagreements rather than generating longer reasoning chains.
The connection to multi-agent debate is suggestive. Since Why do multi-agent LLM systems converge without genuine deliberation?, the Consensus Game forces genuine deliberation between two perspectives (generative and discriminative) within a single model — the equilibrium constraint prevents premature convergence because both agents must independently arrive at consistent signals. And since When does debate actually improve reasoning accuracy?, the Consensus Game sidesteps the evidence-verification problem that plagues inter-model debate: both "agents" operate within the same model's knowledge, so there is no risk of one agent persuading the other with rhetorically superior but factually wrong arguments -- the equilibrium constraint forces agreement on what the model actually knows rather than what it can argue most convincingly.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Could superposed decoding algorithms maintain multi-task representation during generation?
- How can stochastic beam search operationalize step-level confidence into a decoding algorithm?
- Why do generative and discriminative language model procedures disagree?
- Why can generative verifiers scale verification compute more effectively than fixed-output discriminative models?
- What is the relationship between prefix sharing and speculative decoding?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
Equilibrium-Ranking is a specific mechanism: test-time compute spent reconciling internal disagreements
-
Why do multi-agent LLM systems converge without genuine deliberation?
Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
game-theoretic equilibrium prevents premature convergence
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
Consensus Game implicitly parallelizes by running both generative and discriminative procedures
-
Can disagreement be resolved without either party fully yielding?
Explores whether dialogue can move past winner-take-all debate or forced consensus to genuine mutual adjustment. Matters for AI systems that need to work through real disagreement with users.
Consensus Game is mechanistic dialectical reconciliation within a single model
-
Can models trained on many imperfect experts outperform each one?
Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
training-time analog: transcendence extracts consensus from diverse human experts encoded in weights, Consensus Game extracts consensus between a single model's generative and discriminative modes; both demonstrate that aggregation over diverse perspectives outperforms any single perspective
-
When does debate actually improve reasoning accuracy?
Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.
Consensus Game sidesteps debate's evidence-verification problem: both "agents" share the same knowledge, so equilibrium forces agreement on actual knowledge rather than rhetorical persuasion
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Consensus Game: Language Model Generation via Equilibrium Search
- Game-theoretic LLM: Agent Workflow for Negotiation Games
- Everything Everywhere All At Once: Llms Can In-context Learn Multiple Tasks In Superposition
- LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory
- Transcendence: Generative Models Can Outperform The Experts That Train Them
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
- The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
- Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
Original note title
game-theoretic equilibrium between generative and discriminative LM decoding reconciles their inconsistent predictions — small models with consensus match models 100x larger