INQUIRING LINE

How should moderator LLMs decide which speakers to query per topic?

This explores the design problem behind a 'moderator' LLM in a multi-speaker setting — how it should pick *who* to call on for a given topic — and what the corpus offers on querying decisions, even though it never uses the word 'moderator' directly.


This reads the question as a routing-and-querying decision: given a topic and several possible speakers, how does a moderator LLM choose whom to query, and when? The corpus doesn't have a paper labeled 'moderator,' but it has surprisingly direct material if you treat the problem as three sub-decisions — who has signal on this topic, when to ask versus infer, and how to attribute what comes back without breaking trust.

The most useful reframe comes from work on *when* an agent should ask at all. Conversation-analysis research formalizes 'insert-expansions' — the clarifying side-questions humans use to scope intent before acting — as a principled trigger for when a tool-enabled model should probe a person rather than silently proceed When should AI agents ask users instead of just searching?. For a moderator, this is the core gate: query a speaker when their input would change the answer, not on a fixed schedule. The companion finding is that the *model itself* is often a better judge of what to fetch than a passive retriever sitting in front of it — letting the model emit structured, iterative requests for tools beats single-round semantic matching Can models decide better than retrievers which tools to use?. Swap 'tools' for 'speakers' and you get a design principle: let the moderator reason its way to who to call on, refining across turns, rather than precomputing a similarity score between topic and speaker.

But *which* speaker has relevant signal is a retrieval problem in disguise, and the corpus warns there's no single right strategy. Large-corpus recommenders need four distinct retrieval patterns — dense embedding, direct LLM search, concept-based, and API lookup — each with different latency and accuracy tradeoffs, and hybrids usually win How should LLM-based recommenders retrieve from massive item corpora?. A moderator choosing speakers faces the same fork: matching a topic to a speaker by embedding similarity is cheap but shallow, while reasoning over each speaker's history is richer but slower. And what you match *on* matters — personalization research finds that people's past *outputs* (what they said and how) predict their relevance far better than their past *inputs* or queries Do user outputs outperform inputs for LLM personalization?. So a moderator should profile speakers by their prior contributions' style and stance, not by the questions they asked.

Two failure modes lurk here, and this is the part a reader might not expect to care about. First, attribution: the moment a moderator routes and summarizes across speakers, it inherits the exact failure that makes LLM meeting summaries untrustworthy — mis-attributing who said what damages group accountability, and 'globally important' is not the same as 'relevant to this person' Why do LLM meeting summaries fail to help individuals?. Querying the right speaker is wasted if the moderator then mis-credits the reply. Second, topic discipline: models reliably drift toward conversational distractors because they're trained on what-to-do but not what-to-ignore, a gap closable with surprisingly little targeted data Why do language models engage with conversational distractors?. A moderator that can't hold a topic will query speakers about the wrong thing.

The quietly important takeaway: don't trust the moderator to *hold a position* about who matters. Models conform to the shape of whatever framing is in front of them rather than defending a stable stance Do LLMs actually hold stable positions or just mirror user arguments? — so a moderator's judgment of 'who's relevant here' will bend to how the topic was phrased to it. That argues for grounding speaker selection in explicit, auditable signals (contribution history, declared expertise) rather than the model's in-the-moment sense of fit. The decision of whom to query is less a ranking problem than a discipline problem: ask only when it changes the outcome, match on what people actually contributed, attribute carefully, and don't let the framing of the topic quietly rewrite who counts as relevant.


Sources 7 notes

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

How should LLM-based recommenders retrieve from massive item corpora?

RecLLM identifies four retrieval patterns—dual-encoder, direct LLM search, concept-based, and search-API lookup—each optimized for different corpus sizes, latency budgets, and training constraints. Hybrid approaches mixing multiple strategies likely work best for real systems.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Why do LLM meeting summaries fail to help individuals?

A user study of seven participants found three critical failures: systems summarize global importance rather than individual relevance, mis-attributions damage group trust and accountability, and one format cannot serve both quick scanning and detailed reference needs.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Do LLMs actually hold stable positions or just mirror user arguments?

Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about moderator LLM routing decisions in multi-speaker settings. The question remains open: how should a moderator LLM choose which speakers to query, and when?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–07/2025. Key constraints from the path:
• Insert-expansions (clarifying side-questions) formalize WHEN to query rather than infer — query only when the answer would change (2023-07, arXiv:2307.01644).
• Proactive tool-retrieval — letting the model iteratively request speakers beats pre-computed semantic matching on topics (2025-06, arXiv:2506.01056).
• Speaker relevance is driven by their past *outputs* (what they contributed, tone, stance) far more than by input history or metadata (2024-06, arXiv:2406.17803).
• Attribution and topic-following are critical failure modes: LLMs mis-attribute speakers and drift off-topic under framings, even with instruction-tuning (2023-07, arXiv:2307.15793; 2024-04, arXiv:2404.03820).
• Models hold whatever argument-shape the user presents rather than defend stable relevance judgments; framing rewrites who "counts" (2025-07, arXiv:2507.21083; 2025-06, arXiv:2506.20020).

Anchor papers (verify; mind their dates):
• arXiv:2307.01644 (2023-07): Insert-expansions for Tool-enabled Conversational Agents
• arXiv:2406.17803 (2024-06): Role of User Profile in LLM Personalization
• arXiv:2404.03820 (2024-04): CantTalkAboutThis — Topic Alignment in Dialogues
• arXiv:2506.01056 (2025-06): MCP-Zero — Proactive Toolchain Construction

Your task:
(1) RE-TEST EACH CONSTRAINT. For the four claims above, assess whether advances in model scale, multi-turn reasoning (arXiv:2505.06120), reinforcement-learning-guided search (arXiv:2503.09516), or persona-grounding have since relaxed the topic-drift, attribution, or framing-sensitivity problems. Distinguish the durable question — *how* to route speakers — from perishable limitations — *why current routing fails*. Cite what resolved each, and plainly state where constraints still hold.
(2) SURFACE CONTRADICTING WORK. Identify any post-2025-06 papers that relax the "models conform to framing" claim or show robust speaker-attribution in multi-turn dialogue.
(3) PROPOSE 2 RESEARCH QUESTIONS that assume the regime may have moved: e.g., "Given improved topic-following in 2025 models, can a moderator safely rank speakers by soft relevance instead of explicit history?" or "Do reinforcement-learning-trained agents overcome attribution drift?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines