Can agents learn user intent from unlabeled video without text labels?
This explores whether an agent can figure out what a user is trying to do purely by watching their screen activity — no captions, transcripts, or hand-labeled examples — and where the corpus says that approach pays off versus where it hits limits.
This explores whether an agent can figure out what a user is trying to do purely by watching their screen activity — no captions, transcripts, or hand-labeled examples. The corpus's most direct answer is yes, and it comes from borrowing a trick used in self-supervised vision. UI-JEPA applies JEPA-style predictive masking to raw screen recordings: it hides chunks of a UI video and trains the model to predict the missing parts, which forces it to learn task-aware temporal representations. An LLM decoder can then read those representations and infer intent with only minimal paired text. The key move is economic — it swaps the scarce, expensive resource (labeled video) for the abundant one (unlabeled streams of people using interfaces) Can unlabeled UI video teach models what users intend?.
Watching instead of asking shows up as a broader theme, not just a single paper. M3-Agent learns user preferences from continuous multimodal observation by building an entity-centric memory graph that separates one-off episodic events from durable semantic knowledge — so it can infer what you tend to want without ever interrupting to ask. That reframes 'learning intent from video' as a memory-architecture problem as much as a perception one: the question isn't only what you can extract from a stream, but how you bind and retain it across time Can agents learn preferences by watching rather than asking?.
There's a quieter obstacle the corpus flags, though: making sense of the pixels themselves. Vision-only GUI agents stumble when a single model has to both recognize what an on-screen element means and decide what to do about it at the same time. OmniParser shows that pre-parsing the screen into labeled semantic elements unblocks the model by letting it focus on the action — a reminder that 'unlabeled video' still benefits from structure somewhere in the pipeline, even if that structure isn't human-authored intent labels Why do vision-only GUI agents struggle with screen interpretation?. The deeper limit is one that watching alone can't escape: agents trained only on demonstrated behavior are capped by what was demonstrated. Expert datasets lock competence to the curator's imagination, and a model that only ever observes can't learn from its own failures the way one that interacts can Can agents learn beyond what their training data shows?.
Which is why the corpus keeps circling back to a tension worth knowing about: silent inference versus asking out loud. Tool-using LLMs drift away from real user intent precisely because they chain actions silently instead of checking in, and conversation analysis offers a formal account — 'insert-expansions' — of the moments when an agent should pause and probe rather than guess When should AI agents ask users instead of just searching?. Even efficiency cuts both ways: proactively supplying what it infers you need can slash conversation length by up to 60%, but only when the inference is right Could proactive dialogue make conversations dramatically more efficient?. So the honest synthesis is this — yes, agents can learn intent from unlabeled video, and the self-supervised route is real and cheap. But the corpus suggests the strongest systems pair silent observation with selective asking, because watching tells you what someone did, not always what they meant.
Sources 6 notes
UI-JEPA applies JEPA-style predictive masking to screen recordings, learning task-aware temporal representations that an LLM decoder can use to infer intent with minimal paired data. This trades the bottleneck of labeled video for abundant unlabeled streams.
M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.
OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.