SYNTHESIS NOTE

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Synthesis note · 2026-02-23 · sourced from Agents Multi

Single-agent self-improvement through iterative finetuning hits a wall fast. After one round of finetuning on its own generated outputs, performance saturates and begins to drop — the model becomes fixated on a narrow range of responses, limiting diversity and degrading accuracy. This is the training-time analog of Does a model improve by arguing with itself? at inference time: a single model trapped in its own distribution.

The multiagent finetuning framework (Du et al., 2025) proposes a structural fix: instead of training one model iteratively, train a society of models, each starting from the same base but independently specialized through distinct training data generated via multi-agent interactions. Generation agents produce initial responses; critic agents evaluate and refine them through debate. Each model sees different data because the interactions are role-dependent.

The mechanism works because role specialization prevents convergence to a single mode. When one model is trained to generate and another to critique, their training distributions diverge, maintaining the diversity that single-agent training destroys. The summarization step between debate rounds further helps by eliminating redundant information and retaining critical points — removing summarization hurts performance. Removing critics also degrades output quality, confirming that the evaluative role is load-bearing, not decorative.

This connects directly to Does policy entropy collapse limit reasoning performance in RL?: the entropy collapse that limits RL training is mitigated when multiple agents maintain distinct policy distributions. And since Why do LLMs generate novel ideas from narrow ranges?, the training-time diversity preservation through multi-agent specialization could address the output-time diversity problem upstream.

The cost is real — multiple model copies for training and inference. But the finding that single-agent FT collapses after one iteration means the choice is not "cheap single-agent" vs "expensive multi-agent" but "one iteration of productive training" vs "sustained improvement across many rounds."

Inquiring lines that read this note 11

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

What role does environment diversity play in preventing agents from overfitting to curator imagination?

How do multi-agent systems achieve genuine cooperation and reasoning?

When does optimizing for quality undermine the value of diversity?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Can diverse expert demonstrations exceed the knowledge of any single expert?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 148 in 2-hop network ·medium cluster Open in graph ↗

Can multiple agents stay diverse during training… Does a model improve by arguing with itself? Does policy entropy collapse limit reasoning perfo… Why do LLMs generate novel ideas from narrow range… Why do multi-agent LLM systems converge without ge… Does training on AI-generated content permanently …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
inference-time analog; this is the training-time version
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the entropy dynamics this approach counteracts
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
the output-time diversity problem this could address upstream
Why do multi-agent LLM systems converge without genuine deliberation? Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
the multi-agent convergence failure that critic roles help prevent
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
single-agent FT collapse is a specific instance

Can multiple agents stay diverse during training together?

Inquiring lines that read this note 11

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4