Can multiple agents stay diverse during training together?
Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.
Single-agent self-improvement through iterative finetuning hits a wall fast. After one round of finetuning on its own generated outputs, performance saturates and begins to drop — the model becomes fixated on a narrow range of responses, limiting diversity and degrading accuracy. This is the training-time analog of Does a model improve by arguing with itself? at inference time: a single model trapped in its own distribution.
The multiagent finetuning framework (Du et al., 2025) proposes a structural fix: instead of training one model iteratively, train a society of models, each starting from the same base but independently specialized through distinct training data generated via multi-agent interactions. Generation agents produce initial responses; critic agents evaluate and refine them through debate. Each model sees different data because the interactions are role-dependent.
The mechanism works because role specialization prevents convergence to a single mode. When one model is trained to generate and another to critique, their training distributions diverge, maintaining the diversity that single-agent training destroys. The summarization step between debate rounds further helps by eliminating redundant information and retaining critical points — removing summarization hurts performance. Removing critics also degrades output quality, confirming that the evaluative role is load-bearing, not decorative.
This connects directly to Does policy entropy collapse limit reasoning performance in RL?: the entropy collapse that limits RL training is mitigated when multiple agents maintain distinct policy distributions. And since Why do LLMs generate novel ideas from narrow ranges?, the training-time diversity preservation through multi-agent specialization could address the output-time diversity problem upstream.
The cost is real — multiple model copies for training and inference. But the finding that single-agent FT collapses after one iteration means the choice is not "cheap single-agent" vs "expensive multi-agent" but "one iteration of productive training" vs "sustained improvement across many rounds."
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What role does environment diversity play in preventing agents from overfitting to curator imagination?
- Why does diversity without expertise produce worse results than a single capable agent?
- Why does island model genetic evolution maintain diversity better than single populations?
- Can diverse expert demonstrations exceed the knowledge of any single expert?
- What conditions make training diversity better than individual expert quality?
- How does mutual shaping through diverse training compare to population-level diversity effects?
- How does role specialization preserve reasoning diversity in multi-agent teams?
- Can cognitive diversity overcome expertise gaps in agent teams?
- Can cognitive diversity compensate for lack of expertise in agent teams?
- How much does diversity training cost in single-shot pass@1 performance?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
inference-time analog; this is the training-time version
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the entropy dynamics this approach counteracts
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
the output-time diversity problem this could address upstream
-
Why do multi-agent LLM systems converge without genuine deliberation?
Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
the multi-agent convergence failure that critic roles help prevent
-
Does training on AI-generated content permanently degrade model quality?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
single-agent FT collapse is a specific instance
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
- ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
- Artifacts as Memory Beyond the Agent Boundary
- What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
- Vector Policy Optimization: Training for Diversity Improves Test-Time Search
- Towards a Science of Scaling Agent Systems
- LIMI: Less is More for Agency
Original note title
multi-agent finetuning preserves reasoning diversity by training agents on distinct data and roles — single-agent self-improvement saturates after one iteration