In a recent paper posted to the arXiv* server, researchers introduced VideoGUI, an innovative benchmark designed to evaluate graphical user interface (GUI) assistants on complex, visual-centric tasks derived from high-quality web instructional videos. Moreover, they discussed evaluation metrics and presented results from state-of-the-art models assessed on the VideoGUI benchmark.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
GUI automation involves controlling computer programs or systems through their graphical interfaces by performing actions such as clicking buttons, typing text, or dragging elements. This automation serves various purposes, including testing, accessibility, scripting, and personalization. However, developing and evaluating GUI automation systems is challenging, as it requires understanding complex user intents, reasoning about diverse UI elements, planning appropriate actions, and executing precise operations.
Existing benchmarks like Mind2Web, PixelHelp, and OSWorld primarily focus on simple tasks that can be specified with a single textual command, such as "Open a new tab" or "Search for a product." These tasks often fail to capture real-world scenarios where users may struggle with novel and advanced tasks, such as "Create a special animation effect in PowerPoint" or "Edit a video clip in Premiere Pro." Moreover, these advanced tasks rely heavily on visual signals, necessitating users to follow instructional videos or demonstrations to learn and replicate desired effects.
About the Research
In this study, the authors proposed VideoGUI as a new multi-modal benchmark for GUI automation, emphasizing complex and visual-centric tasks that often require users to replicate lengthy operations and achieve specific goals through instructional videos. This benchmark included 178 tasks across 11 software applications categorized into media creation, media editing, and media browsing.
The tasks were derived from instructional videos, showcasing practical and novel uses of software such as Adobe Photoshop, Premiere Pro, After Effects, PowerPoint, Runway, and Stable Diffusion. Activities encompassed video editing, image manipulation, animation effects, and visual creation.
VideoGUI provides high-quality annotations by having participants reproduce instructional videos, capturing labels ranging from procedural planning to atomic actions with element locations. It encompasses 86 complex tasks (full tasks) and 92 simple tasks (subtasks) that do not require high-level planning. These 86 full tasks can be further divided into 371 subtasks, totaling 463 subtasks. Overall, the benchmark collects 2.7K atomic manual actions.
Furthermore, the researchers proposed an evaluation suite for VideoGUI through a hierarchical process. In the high-level planning stage, this involves reconstructing procedural sub-tasks from visual signals without language description. The middle-level planning phase focuses on detailing steps to complete a sub-task with a sequence of precise action descriptions based on the visual state and textual query.
Lastly, atomic action execution involves performing target actions such as clicking, dragging, typing, or scrolling. For each level, the study designed assessment metrics across individual dimensions to evaluate model performance, including distance, recall, precision, and accuracy.
Moreover, extensive experiments were conducted on VideoGUI using leading multi-modal large language models (MLLMs) such as generative pre-text transformer version 4o (GPT-4o), Claude-3-Opus, Gemini-Pro-V, and Qwen-VL-Max, alongside text-only large language models (LLMs) like GPT-3.5-Turbo, Mixtral-8x22B, and LLama3-70B. Additionally, the authors introduced a modular method, CogAgent, combining an LLM with a visual expert. Results were reported for high-level planning, middle-level planning, atomic action execution, and an overall score integrating these aspects.
Research Findings
The outcomes indicated that even the top-performing model, GPT-4o, struggled to complete a single full task in VideoGUI, achieving an overall score of 39.4 out of 100. Surprisingly, the challenge lay in planning rather than action execution, despite GPT-4o not being renowned for its grounding abilities.
The authors noted that planning from textual queries was consistently easier compared to planning from visuals across all models evaluated, underscoring the challenges of visual-centric GUI tasks. They further analyzed model performance across various software applications and action categories, identifying strengths and weaknesses. Additionally, they demonstrated that augmenting multimodal models with tools like optical character recognition (OCR) or segmentation of meaning (SoM) could notably enhance action execution capabilities.
Applications
The paper highlighted VideoGUI as a pivotal resource for advancing GUI automation research and applications. It effectively identifies current limitations in existing models and systems, focusing on challenges like visual reasoning, procedural planning, and action execution.
Moreover, VideoGUI inspires the development of innovative models and methods harnessing multi-modal inputs and outputs, including instructional videos, screenshots, and UI elements. It is expected to facilitate the creation of realistic and interactive GUI assistants supporting users in tasks such as media creation, editing, and browsing, thereby boosting productivity and fostering creativity in interactive environments.
Conclusion
In summary, the novel benchmark demonstrated its effectiveness in advancing GUI automation. It offered high-quality annotations derived from human demonstrations and comprehensive evaluation metrics across various levels and dimensions. Additionally, it underscored the notable gap between current state-of-the-art models and desired performance levels, highlighting both challenges and opportunities for future research and applications in this field.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.