Why might text-only interfaces underestimate agent preference elicitation capabilities?
This explores why judging an agent's ability to learn what users want by watching it chat in text may sell it short — because preference elicitation happens through channels the text box never sees: observation, richer interfaces, and active questioning.
This reads the question as being about measurement bias: if you only ever watch an agent infer preferences through a text chat window, you may conclude it's weak at preference elicitation — when in fact the text box is the bottleneck, not the agent. The corpus suggests the limitation lives in the interface, not the capability.
Start with the assumption baked into 'text-only': that eliciting a preference means asking for it in words. But agents can learn preferences by watching rather than asking. The M3-Agent work shows that an entity-centric memory graph fed by continuous multimodal observation lets an agent infer and act on what a user wants without ever posing a question Can agents learn preferences by watching rather than asking?. A text-only setup is blind to exactly this — the passive, ambient signal — so it can only measure the narrow slice of elicitation that survives being typed out.
Text is also a thin and passive channel in its own right. Conversational agents are structurally passive: their training optimizes for responding, not for leading, so in a pure chat setting they won't proactively probe for what they don't yet know Why can't conversational AI agents take the initiative?. And when you compare interfaces head-to-head, users prefer generated task-specific UIs over text blocks in more than 70% of cases — structured, interactive surfaces let people express and refine intent that a wall of text muddies Do generated interfaces outperform text-based chat for most tasks?. A dashboard with sliders surfaces preferences that the same user would never volunteer in prose.
The same loss-of-signal shows up at the perception layer. Text-based GUI agents that read a page as HTML or an accessibility tree miss what humans actually see; real grounding needs vision, not a flattened text transcript of the screen Do text-based GUI agents actually work in the real world?. Preference cues — what a user lingers on, what they click — live in that perceptual richness, and a text-only evaluation discards them before the agent gets a chance.
Worth noticing is how cheaply elicitation can work once you give it the right channel. PReF shows that just ten well-chosen adaptive questions can pin down a personalized reward function through active learning Can user preferences be learned from just ten questions?, and conversation-analysis 'insert-expansions' give a formal account of when an agent should pause to ask versus quietly proceed When should AI agents ask users instead of just searching?. The takeaway you might not have expected: the agent's real preference-elicitation ability is a product of the modality it's allowed to use — abstract semantic preference summaries also beat replaying raw past chats Does abstract preference knowledge outperform specific interaction recall? — so a text-only interface doesn't just limit the agent, it quietly hides how good it could be.
Sources 7 notes
M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.
Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.
Research shows users strongly prefer LLM-generated interactive interfaces—dashboards, tools, animations—over text blocks, especially for structured and information-dense tasks. Structured representation and iterative refinement reduce cognitive load.
ShowUI demonstrates that GUI agents need end-to-end vision-language-action models with UI-aware token selection and interleaved streaming, not adapted general-purpose MLLMs. Standard multimodal models lack the grounding and action capabilities real interface navigation demands.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.