Multi-Agent Systems

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Can personas extracted from documents generalize across evaluation tasks?

This explores whether automating persona creation from domain documents—rather than hand-crafting roles—enables multi-agent evaluators to transfer across different tasks without redesign. The question matters because manual personas fail to generalize across domains.

Why do autonomous LLM agents fail in predictable ways?

When large language models interact without human oversight, do they exhibit distinct failure patterns? Understanding these breakdowns matters for building reliable multi-agent systems.

Does cognitive diversity alone improve multi-agent ideation quality?

This explores whether diverse perspectives in group AI systems automatically produce better ideas, or if something else—like expertise—is equally critical for collaborative ideation to outperform solo agents.

Can tailoring queries per document improve debatable summarization?

When summarizing documents with opposing perspectives on a topic, does adapting the query to each document's unique content retrieve more balanced viewpoints than using a single uniform query?

Does structured artifact sharing outperform conversational coordination?

Explores whether agents coordinating through standardized documents rather than natural language messages achieve better collaboration outcomes. Matters because it challenges the default conversational paradigm in multi-agent system design.

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Can branching prompts replicate what multi-agent systems do?

Explores whether non-linear prompting structures (tree-of-thought, debate prompting) can functionally replace multi-agent architectures, and whether a single LLM simulating multiple personas achieves the same cognitive benefits as multiple models collaborating.

Can AI systems design unique multi-agent workflows per individual query?

Explores whether meta-agents trained with reinforcement learning can automatically generate personalized multi-agent system architectures tailored to individual user queries, rather than applying fixed task-level templates uniformly.

Why do standard alignment methods ignore partner interventions?

Standard RLHF and DPO optimize for token-level quality but may structurally prevent agents from meaningfully incorporating partner input. This explores whether the training objective itself blocks collaborative reasoning.