Can one model understand both UIs and infographics equally well?
Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?
Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?
Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?
Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.
Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.
Vibe coding claims to keep developers steering and validating, but do novices actually engage with code and testing the way the tool design assumes? The gap between intended and actual behavior could compound failures.
When novices use AI coding tools, do they engage with the code itself, or do they primarily test the prototype? Understanding where students focus reveals how AI-assisted coding shapes learning behavior.
The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.