INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›What limits conversational AI effe…›How should conversational agents b…›this inquiring line

Can an AI learn what you're trying to accomplish just by watching your screen, with no labels or instructions at all?

Can agents learn user intent from unlabeled video without text labels?

This explores whether an agent can figure out what a user is trying to do purely by watching their screen activity — no captions, transcripts, or hand-labeled examples — and where the corpus says that approach pays off versus where it hits limits.

This explores whether an agent can figure out what a user is trying to do purely by watching their screen activity — no captions, transcripts, or hand-labeled examples. The corpus's most direct answer is yes, and it comes from borrowing a trick used in self-supervised vision. UI-JEPA applies JEPA-style predictive masking to raw screen recordings: it hides chunks of a UI video and trains the model to predict the missing parts, which forces it to learn task-aware temporal representations. An LLM decoder can then read those representations and infer intent with only minimal paired text. The key move is economic — it swaps the scarce, expensive resource (labeled video) for the abundant one (unlabeled streams of people using interfaces) Can unlabeled UI video teach models what users intend?.

Watching instead of asking shows up as a broader theme, not just a single paper. M3-Agent learns user preferences from continuous multimodal observation by building an entity-centric memory graph that separates one-off episodic events from durable semantic knowledge — so it can infer what you tend to want without ever interrupting to ask. That reframes 'learning intent from video' as a memory-architecture problem as much as a perception one: the question isn't only what you can extract from a stream, but how you bind and retain it across time Can agents learn preferences by watching rather than asking?.

There's a quieter obstacle the corpus flags, though: making sense of the pixels themselves. Vision-only GUI agents stumble when a single model has to both recognize what an on-screen element means and decide what to do about it at the same time. OmniParser shows that pre-parsing the screen into labeled semantic elements unblocks the model by letting it focus on the action — a reminder that 'unlabeled video' still benefits from structure somewhere in the pipeline, even if that structure isn't human-authored intent labels Why do vision-only GUI agents struggle with screen interpretation?. The deeper limit is one that watching alone can't escape: agents trained only on demonstrated behavior are capped by what was demonstrated. Expert datasets lock competence to the curator's imagination, and a model that only ever observes can't learn from its own failures the way one that interacts can Can agents learn beyond what their training data shows?.

Which is why the corpus keeps circling back to a tension worth knowing about: silent inference versus asking out loud. Tool-using LLMs drift away from real user intent precisely because they chain actions silently instead of checking in, and conversation analysis offers a formal account — 'insert-expansions' — of the moments when an agent should pause and probe rather than guess When should AI agents ask users instead of just searching?. Even efficiency cuts both ways: proactively supplying what it infers you need can slash conversation length by up to 60%, but only when the inference is right Could proactive dialogue make conversations dramatically more efficient?. So the honest synthesis is this — yes, agents can learn intent from unlabeled video, and the self-supervised route is real and cheap. But the corpus suggests the strongest systems pair silent observation with selective asking, because watching tells you what someone did, not always what they meant.

Sources 6 notes

Can unlabeled UI video teach models what users intend?

UI-JEPA applies JEPA-style predictive masking to screen recordings, learning task-aware temporal representations that an LLM decoder can use to infer intent with minimal paired data. This trades the bottleneck of labeled video for abundant unlabeled streams.

Can agents learn preferences by watching rather than asking?

M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.

Why do vision-only GUI agents struggle with screen interpretation?

OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Show all 6 sources

Could proactive dialogue make conversations dramatically more efficient?

Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity1.73 match · arxiv ↗
DiscussLLM: Teaching Large Language Models When to Speak1.72 match · arxiv ↗
Proactive Conversational Agents in the Post-ChatGPT World1.70 match · arxiv ↗
ShowUI: One Vision-Language-Action Model for GUI Visual Agent1.65 match · arxiv ↗
ScreenAI: A Vision-Language Model for UI and Infographics Understanding1.63 match · arxiv ↗
MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind1.62 match · arxiv ↗
OmniParser for Pure Vision Based GUI Agent0.90 match · arxiv ↗
Insert-expansions For Tool-enabled Conversational Agents0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher tracking whether agents can infer user intent from unlabeled video—a question that straddles self-supervised vision, memory architecture, and agentic dialogue. A curated library (spanning 2021–2026) found these dated claims; your task is to stress-test them against the latest capabilities and methods.

What a curated library found—and when:
• Self-supervised predictive masking (JEPA-style) on raw UI video can extract task-aware representations without paired text labels; an LLM decoder then infers intent with minimal examples (UI-JEPA, 2024).
• Entity-centric memory graphs separate episodic events from durable semantic knowledge, enabling preference learning from continuous multimodal streams without explicit user queries (M3-Agent, implicit in corpus; memory-update failure noted 2026).
• Vision-only agents underperform when a single model must both parse visual semantics AND decide actions simultaneously; pre-parsing screens into labeled elements unblocks performance (OmniParser, 2024).
• Agents trained only on demonstrations plateau at curator expertise; silent inference misses intent because it cannot learn from its own failures (2024–2025 findings).
• Selective asking ("insert-expansions") paired with observation can reduce dialogue turns by ~60% but only when inference is correct; proactive dialogue without verification degrades intent fidelity (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2409.04081 (UI-JEPA, 2024) — self-supervised video masking for intent
• arXiv:2408.00203 (OmniParser, 2024) — semantic pre-parsing of GUIs
• arXiv:2307.01644 (Insert-expansions, 2023) — when agents should ask
• arXiv:2501.00383 (Proactive Conversational Agents, 2024) — efficient intent inference

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each claim above, judge whether newer models (Gemini 2.0, o1, newer vision encoders), multi-agent orchestration, in-context learning at scale, or improved video understanding have RELAXED or OVERTURNED the limits. Distinguish the durable question (can intent be inferred from video?) from perishable bottlenecks (e.g., vision parsing, memory decay, silent-vs.-interactive trade-offs). Cite what resolved each, and plainly state where constraints still hold.

(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Has any recent work shown that agents *cannot* learn intent from video, or that the self-supervised route fails at scale? Has dialogue-free intent inference outpaced selective asking?

(3) **Propose 2 research questions that ASSUME the regime may have moved:**
   – Do large multimodal models (vision + language) pre-trained at scale obsolete JEPA-style masking for intent learning, or do they require task-specific tuning that masking still enables?
   – Can memory decay (documented 2026) be mitigated by continuous refinement signals from user feedback, or is the trade-off (fresh but noisy vs. stale but reliable) fundamental?

**Guardrail:** Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can an AI learn what you're trying to accomplish just by watching your screen, with no labels or instructions at all?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8