SYNTHESIS NOTE

Do text-based GUI agents actually work in the real world?

Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.

Synthesis note · 2026-05-03 · sourced from Visual GUI Agents

ShowUI's framing critique is that the dominant GUI agent paradigm — language-based agents calling closed-source APIs with text-rich meta-information like HTML or accessibility tree — assumes oracle access that real-world deployment does not have. Users interact with interfaces visually through screenshots, without the underlying structural information that text-based agents depend on. The text-only approach is therefore architecturally limited regardless of model scale.

But GUI visual perception is not a problem natural-image MLLMs solve well. UI tasks need specialized capabilities — element grounding, action execution — rather than the conversational abilities multimodal chatbots are tuned for. ShowUI proposes three innovations addressing the resulting gaps.

UI-Guided Visual Token Selection treats screenshots as UI-connected graphs and adaptively identifies redundant relationships, using these as criteria for token selection during self-attention. This reduces compute by exploiting that screenshots are not natural images — large portions are visually redundant (background, repeated elements) and the connectivity structure of UI components encodes which tokens carry information.

Interleaved Vision-Language-Action Streaming unifies diverse needs within GUI tasks — managing visual-action history during navigation, pairing multi-turn query-action sequences per screenshot to enhance training efficiency. Treating vision, language, and action as a single interleaved stream is more flexible than the staged pipelines that dominate prior work.

Small-scale High-quality GUI Instruction Datasets result from careful curation and resampling against type imbalance — the data-side intervention that lets the architectural innovations actually train.

The implication is that GUI visual agents are not a special case of multimodal models — they are a domain where the visual prior, the action vocabulary, and the data distribution all need to be UI-shaped from the start. This is the strong end-to-end position that creates a tension with the perception-factoring camp: Why do vision-only GUI agents struggle with screen interpretation? and Can structured interfaces help language models control GUIs better? both keep the foundation MLLM general-purpose and add a structured perception layer. ShowUI argues the perception layer should be inside the model, made UI-shaped end-to-end. The two camps disagree on whether to factor perception out or train it in.

Inquiring lines that read this note 14

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Should GUI agents use structured representations instead of raw pixels?

How do standardized protocols improve coordination in multi-agent systems?

Can API-first interaction replace traditional UI-based agent interfaces?

How do we evaluate AI systems when user perception misleads actual performance?

How does API-first interaction compare to generative interface approaches?

How should conversational agents balance goal-driven initiative with user control?

Why might text-only interfaces underestimate agent preference elicitation capabilities?

What drives capability and cost efficiency in agent systems?

Why do APIs outperform UIs for agent task completion?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 67 in 2-hop network ·medium cluster Open in graph ↗

Do text-based GUI agents actually work in the re… Why do vision-only GUI agents struggle with screen… Can structured interfaces help language models con… Can unlabeled UI video teach models what users int… Why do planning and grounding pull against each ot… Do generated interfaces outperform text-based chat…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do vision-only GUI agents struggle with screen interpretation? Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.
contradicts: OmniParser argues factor perception OUT of the foundation model with a pre-processing parser; ShowUI argues build perception IN with UI-specialized VLA models. Same problem, opposite architectural answer.
Can structured interfaces help language models control GUIs better? Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.
tension with: Agent S uses accessibility tree as a grounding aid alongside vision; ShowUI argues accessibility-tree dependence is the architectural ceiling that must be removed for real-world deployment.
Can unlabeled UI video teach models what users intend? Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.
complements: UI-JEPA is the self-supervised pretraining recipe that ShowUI's UI-specialized VLA approach depends on — UI-shaped perception needs UI-shaped pretraining.
Why do planning and grounding pull against each other in agents? Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?
complicates: AutoGLM's intermediate-interface argument depends on factoring; ShowUI suggests the factoring may be a sub-optimal compromise that better UI-shaped models will eventually obviate.
Do generated interfaces outperform text-based chat for most tasks? Explores whether LLMs should create interactive UIs instead of text responses, and under what conditions users prefer dynamic interfaces to traditional conversational chat.
connects: if interfaces become generative and dynamic, the case for UI-shaped end-to-end vision strengthens — accessibility trees won't exist for novel generated interfaces.

Do text-based GUI agents actually work in the real world?

Inquiring lines that read this note 14

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4