INQUIRING LINE

Should GUI agents use intermediate structured representations instead of raw pixels?

This explores whether agents that operate software interfaces should work from parsed, semantic descriptions of the screen rather than directly from raw screenshots — and the corpus suggests the more interesting answer is that the best systems don't choose, they layer.


This explores whether GUI agents should read intermediate structured representations instead of raw pixels — and the corpus leans clearly toward "yes, structure helps," while complicating *why*. The core finding is that asking a vision-language model to do two jobs at once — figure out what each icon means *and* decide what to click — overloads it. OmniParser showed GPT-4V fails on raw screenshots precisely because of this composite burden; pre-parsing the screen into labeled semantic elements lets the model spend all its effort on the actual decision Why do vision-only GUI agents struggle with screen interpretation?. So the case for structure isn't really about pixels-versus-text; it's about *separating tasks that fight each other when fused*.

That separation principle is where the corpus gets interesting, because multiple independent systems converged on it. Agent S, AutoGLM, and OmniParser all landed on splitting an agent into a *planning* layer and a *grounding* layer, mediated by a language-centric interface — because planning and grounding have opposing optimization needs and shouldn't be jammed into one end-to-end prediction How should agents split planning from visual grounding? Can structured interfaces help language models control GUIs better?. The structured representation (accessibility trees, parsed elements) is essentially the seam that lets each layer be optimized on its own terms. Adrian-style: the win isn't "text beats pixels," it's "give the model one thing to think about at a time."

But there's a sharp counter-voice. ShowUI argues that text-based representations like HTML and accessibility trees *miss what humans actually see* on screen, and that real interface navigation needs purpose-built vision-language-action models — not general multimodal models bolted onto a parser Do text-based GUI agents actually work in the real world?. So structure can also throw away information. The reconciliation most of these systems reach is *both/and*: Agent S feeds visual input for environmental understanding *plus* image-augmented accessibility trees for grounding, rather than picking one Can structured interfaces help language models control GUIs better?.

Here's the thing you didn't know you wanted to know: the most radical answer in the corpus is to skip the GUI entirely. The AXIS framework shows that when agents call APIs instead of clicking through interfaces, task completion time drops 65–70% while accuracy stays near 98% — and the system can auto-discover APIs hidden inside existing apps Can API-first agents outperform UI-based agent interaction?. A GUI is, after all, a representation designed for human eyes and hands. If an agent doesn't have those constraints, the screenshot itself may be the unnecessary intermediate layer. This reframes the whole question: "structured representation vs. pixels" is a debate that only matters once you've decided the agent must go through the GUI at all.

If you want to zoom out further, the GUI debate is one instance of a broader pattern in agent design: reliability tends to come from *externalizing* hard sub-problems into structured scaffolding rather than asking a bigger model to solve everything internally — memory, skills, and interaction protocols pushed into a harness layer Where does agent reliability actually come from?. The parsed screen is exactly this move applied to perception: don't make the model re-derive the interface every step; hand it structure. That's the deeper reason intermediate representations keep winning.


Sources 6 notes

Why do vision-only GUI agents struggle with screen interpretation?

OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.

How should agents split planning from visual grounding?

Multiple independent systems (Agent S, AutoGLM, OmniParser) converged on factoring agent reasoning into a planning layer and a grounding layer, with a language-centric Agent-Computer Interface mediating between them due to their opposing optimization requirements.

Can structured interfaces help language models control GUIs better?

Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.

Do text-based GUI agents actually work in the real world?

ShowUI demonstrates that GUI agents need end-to-end vision-language-action models with UI-aware token selection and interleaved streaming, not adapted general-purpose MLLMs. Standard multimodal models lack the grounding and action capabilities real interface navigation demands.

Can API-first agents outperform UI-based agent interaction?

The AXIS framework shows that prioritizing API calls over sequential UI interactions cuts task completion time by 65–70% while maintaining 97–98% accuracy and reducing cognitive workload by 38–53%. A self-exploration mechanism automatically discovers and constructs APIs from existing applications, solving the bootstrapping problem.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether GUI agents should use intermediate structured representations instead of raw pixels — a question a curated library explored across 2024–2026, but whose constraints may have shifted.

What a curated library found — and when (dated claims, not current truth):
Library findings span Feb 2024–May 2026. Key constraints documented then:
• Vision-language models fail on raw screenshots because they must simultaneously identify UI elements AND decide actions — OmniParser (2408) showed GPT-4V's composite burden; structure splits this into planning + grounding layers (2410, 2409).
• Text-based representations (HTML, accessibility trees) miss visual saliency humans perceive; ShowUI (2411) argues purpose-built vision-language-action models outperform general multimodal + parser pipelines.
• Agent S and AutoGLM converged on language-centric interfaces mediating visual input + structured elements (2410, 2409), optimizing each layer separately.
• API-first interaction (AXIS framework, ~2024) reduces task completion time 65–70% vs. GUI navigation while maintaining ~98% accuracy — reframing the GUI itself as an unnecessary intermediate layer if APIs are available (2409).
• Externalization into harness-layer scaffolding (memory, skills, protocols) emerges as the deeper pattern driving reliability gains (2604, 2605).

Anchor papers (verify; mind their dates):
• arXiv:2408.00203 — OmniParser for Pure Vision Based GUI Agent (2024-08)
• arXiv:2410.08164 — Agent S: An Open Agentic Framework (2024-10)
• arXiv:2411.17465 — ShowUI: One Vision-Language-Action Model (2024-11)
• arXiv:2604.08224 — Externalization in LLM Agents (2026-04)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the composite-burden thesis: have newer VLMs (GPT-4o, Claude 3.5, Gemini 2.0) or fine-tuned GUI-specific models *relaxed* the need for task separation? Test whether end-to-end vision-to-action now works at scale. For text-vs.-vision: has the gap between ShowUI-style models and hybrid (visual + structured) pipelines narrowed, or has one regime clearly won? Separately identify what remains unsolved (e.g., generalization to novel UIs, long-episode planning) from what newer tooling or training has addressed.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from ~Oct 2024–present: any papers showing end-to-end vision agents *outperforming* structured representation pipelines, or new evaluation suites that flip the verdict?
(3) Propose 2 research questions that ASSUME the regime has moved: (a) If structured representations have become a bottleneck rather than a boost (e.g., because parsing overhead now exceeds gains), what replaces them? (b) If the API-first result (65–70% speedup) now dominates practice, is the GUI-agent research direction becoming niche, and what is the forward research frontier?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines