Can unlabeled UI video teach models what users intend?
Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.
UI-JEPA argues that prior UI-understanding approaches misframe the problem at two levels. Pretrained UI transformers operate at the component level and miss the concept of a task. Image-encoder-plus-LLM systems handle static screenshots and miss temporal structure — they can list widgets but cannot understand what a sequence of UI actions accomplishes. Crawler-based systems handle specific tasks but generalize poorly to unseen ones.
The hypothesis is that user intent is a temporal property of UI activity, not a spatial property of any frame. UI-JEPA therefore processes video sequences of UI actions during task execution, training a JEPA-based encoder with temporal masking on unlabeled UI video — predicting fully masked frames from unmasked frames. Because predicting masked frames forces the encoder to capture temporal relationships and task structure, the resulting representations encode what the user is trying to do, not just what is on the screen.
The decoder side is an LLM conditioned on these representations to produce textual user-intent descriptions. The empirical claim that earns its keep is data efficiency: fine-tuning the decoder requires a fraction of the paired video-text data and compute that SOTA MLLMs need. This matters because labeled UI video is scarce and expensive — the architecture trades the bottleneck of paired labels for the abundance of unlabeled screen recordings.
The broader implication is a separation of concerns: temporal/structural understanding learned self-supervised on unlabeled streams, semantic intent inference layered via a small LLM decoder on top. When labeled data is scarce, the right move is to push the learning into self-supervision and keep the supervised layer thin. This is the same architectural move as Why do vision-only GUI agents struggle with screen interpretation? — factor the perception sub-problem out of the foundation model and hand it the structured signal it can actually use.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can parsing screens into structured elements before acting improve vision models?
- How much does autonomous action without prompting affect user perception?
- Why does explicit screen parsing outperform pure vision in GUI agents?
- Can agents learn user intent from unlabeled video without text labels?
- What makes a self-supervised pruning metric work without labels at scale?
- Why does pure-vision underperform when parsing semantics and action prediction mix?
- What role does prediction error play in human event segmentation?
- Can temporal ranking improve retrieval without modifying the underlying video model?
- Why do static screenshot models fail to capture multi-step UI task intent?
- What temporal signals in screen recordings matter most for task understanding?
- What makes accessibility trees insufficient compared to visual GUI understanding?
- Can self-supervised process models replace human annotations at scale?
- Can input-only training encode user preferences without task-specific labels?
- How does UI-guided token selection reduce compute compared to standard vision?
- What makes high-quality GUI instruction data different from general vision data?
- Why do image captions create different friction than pure video data?
- Can predictive self-supervision work on unlabeled sequential visual data?
- How does annotation-based pretraining compare to self-supervised video masking for screen understanding?
- Why does identifying UI element types and locations enable downstream task learning?
- Why do small specialized models match frontier multimodal models on screen tasks?
- Can text-based and vision-based screen understanding achieve similar performance?
- How can frame sampling and ranking improve temporal understanding in long-video retrieval?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do vision-only GUI agents struggle with screen interpretation?
Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.
complements: same factoring move (specialized perception layer + foundation model on top) applied to temporal video rather than spatial screenshots.
-
Do text-based GUI agents actually work in the real world?
Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.
extends: ShowUI argues UI perception needs UI-specialized VLA models; UI-JEPA is the upstream pretraining recipe — UI-shaped self-supervision before the supervised layer.
-
How can GUI agents adapt when software constantly changes?
Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.
complements: Agent S relies on episodic memory of UI traces — UI-JEPA-style representations could provide a richer encoding for that episodic store than raw screenshots.
-
Can models reason without generating visible thinking steps?
Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
extends: same family principle — useful representations don't require verbalization. UI-JEPA shows the predictive-feature principle works for UI temporal understanding.
-
Can careful curation replace massive alignment datasets?
Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.
complements: data efficiency in the LLM decoder layer is enabled by self-supervised pretraining on the encoder side; the two notes cover paired sides of the data-efficiency story.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
- MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
- ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Do Language Models Understand Time?
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- LatentQA: Teaching LLMs to Decode Activations Into Natural Language
- The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
- Learn from your own latents and not from tokens: A sample-complexity theory
Original note title
predictive video masking on UI activity learns user intent without paired text — JEPA-style self-supervision turns unlabeled screen recordings into a usable signal