OmniParser for Pure Vision Based GUI Agent

Paper · arXiv 2408.00203 · Published August 1, 2024
Visual and GUI AgentsTool Use and Computer-Use Agents

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OMNIPARSER, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OMNIPARSER significantly improves GPT-4V’s performance on ScreenSpot benchmark.

Introduction. Large language models have shown great success in their understanding and reasoning capabilities. More recent works have explored the use of large vision-language models (VLMs) as agents to perform complex tasks on the user interface (UI) with the aim of completing tedious tasks to replace human efforts [YZL+23, YYZ+23, DGZ+23, ZGK+24, HWL+23, YZS+24, WXJ+24, GFH+24, CSC+24]. Despite the promising results, there remains a significant gap between current state-of-thearts and creating widely usable agents that can work across multiple platforms, e.g. Windows/MacOS, IOS/Android and multiple applications (Web broswer Office365, PhotoShop, Adobe), with most previous work focusing on limiting applications or platforms. While large multimodal models like GPT-4V and other models trained on UI data [HWL+23, YZS+24, CSC+24] have demonstrated abilities to understand basic elements of the UI screenshot, action grounding remains one of the key challenges in converting the actions predicted by LLMs to the actual actions on screen in terms of keyboard/mouse movement or API call [ZGK+24].

Discussion / Conclusion. Repeated Icons/Texts From analysis of the the GPT-4V’s response log, we found that GPT-4V often fails to make the correct prediction when the results of the OMNIPARSER contains multiple repeated icons/texts, which will lead to failure if the user task requires clicking on one of the buttons. This is illustrated by the figure 7 (Left) in the Appendix. A potential solution to this is to add finer grain descriptions to the repeated elements in the UI screenshot, so that the GPT-4V is aware of the existence of repeated elements and take it into account when predicting next action. Corase Prediction of Bounding Boxes One common failure case of OMNIPARSER is that it fails to detect the bounding boxes with correct granularity. In figure 7 (Right), the task is to click on the text ’MORE’. The OCR module of OMNIPARSER detects text bounding box 8 which encompass the desired text.