Evaluating GUI Assistants with Video-Based Benchmarks

In a recent paper posted to the arXiv* server, researchers introduced VideoGUI, an innovative benchmark designed to evaluate graphical user interface (GUI) assistants on complex, visual-centric tasks derived from high-quality web instructional videos. Moreover, they discussed evaluation metrics and presented results from state-of-the-art models assessed on the VideoGUI benchmark.

Study: Evaluating GUI Assistants with Video-Based Benchmarks. Image Credit: Summit Art Creations/Shutterstock
Study: Evaluating GUI Assistants with Video-Based Benchmarks. Image Credit: Summit Art Creations/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

GUI automation involves controlling computer programs or systems through their graphical interfaces by performing actions such as clicking buttons, typing text, or dragging elements. This automation serves various purposes, including testing, accessibility, scripting, and personalization. However, developing and evaluating GUI automation systems is challenging, as it requires understanding complex user intents, reasoning about diverse UI elements, planning appropriate actions, and executing precise operations.

Existing benchmarks like Mind2Web, PixelHelp, and OSWorld primarily focus on simple tasks that can be specified with a single textual command, such as "Open a new tab" or "Search for a product." These tasks often fail to capture real-world scenarios where users may struggle with novel and advanced tasks, such as "Create a special animation effect in PowerPoint" or "Edit a video clip in Premiere Pro." Moreover, these advanced tasks rely heavily on visual signals, necessitating users to follow instructional videos or demonstrations to learn and replicate desired effects.

About the Research

In this study, the authors proposed VideoGUI as a new multi-modal benchmark for GUI automation, emphasizing complex and visual-centric tasks that often require users to replicate lengthy operations and achieve specific goals through instructional videos. This benchmark included 178 tasks across 11 software applications categorized into media creation, media editing, and media browsing.

The tasks were derived from instructional videos, showcasing practical and novel uses of software such as Adobe Photoshop, Premiere Pro, After Effects, PowerPoint, Runway, and Stable Diffusion. Activities encompassed video editing, image manipulation, animation effects, and visual creation.

VideoGUI provides high-quality annotations by having participants reproduce instructional videos, capturing labels ranging from procedural planning to atomic actions with element locations. It encompasses 86 complex tasks (full tasks) and 92 simple tasks (subtasks) that do not require high-level planning. These 86 full tasks can be further divided into 371 subtasks, totaling 463 subtasks. Overall, the benchmark collects 2.7K atomic manual actions.

Furthermore, the researchers proposed an evaluation suite for VideoGUI through a hierarchical process. In the high-level planning stage, this involves reconstructing procedural sub-tasks from visual signals without language description. The middle-level planning phase focuses on detailing steps to complete a sub-task with a sequence of precise action descriptions based on the visual state and textual query.

Lastly, atomic action execution involves performing target actions such as clicking, dragging, typing, or scrolling. For each level, the study designed assessment metrics across individual dimensions to evaluate model performance, including distance, recall, precision, and accuracy.

Moreover, extensive experiments were conducted on VideoGUI using leading multi-modal large language models (MLLMs) such as generative pre-text transformer version 4o (GPT-4o), Claude-3-Opus, Gemini-Pro-V, and Qwen-VL-Max, alongside text-only large language models (LLMs) like GPT-3.5-Turbo, Mixtral-8x22B, and LLama3-70B. Additionally, the authors introduced a modular method, CogAgent, combining an LLM with a visual expert. Results were reported for high-level planning, middle-level planning, atomic action execution, and an overall score integrating these aspects.

Research Findings

The outcomes indicated that even the top-performing model, GPT-4o, struggled to complete a single full task in VideoGUI, achieving an overall score of 39.4 out of 100. Surprisingly, the challenge lay in planning rather than action execution, despite GPT-4o not being renowned for its grounding abilities.

The authors noted that planning from textual queries was consistently easier compared to planning from visuals across all models evaluated, underscoring the challenges of visual-centric GUI tasks. They further analyzed model performance across various software applications and action categories, identifying strengths and weaknesses. Additionally, they demonstrated that augmenting multimodal models with tools like optical character recognition (OCR) or segmentation of meaning (SoM) could notably enhance action execution capabilities.

Applications

The paper highlighted VideoGUI as a pivotal resource for advancing GUI automation research and applications. It effectively identifies current limitations in existing models and systems, focusing on challenges like visual reasoning, procedural planning, and action execution.

Moreover, VideoGUI inspires the development of innovative models and methods harnessing multi-modal inputs and outputs, including instructional videos, screenshots, and UI elements. It is expected to facilitate the creation of realistic and interactive GUI assistants supporting users in tasks such as media creation, editing, and browsing, thereby boosting productivity and fostering creativity in interactive environments.

Conclusion

In summary, the novel benchmark demonstrated its effectiveness in advancing GUI automation. It offered high-quality annotations derived from human demonstrations and comprehensive evaluation metrics across various levels and dimensions. Additionally, it underscored the notable gap between current state-of-the-art models and desired performance levels, highlighting both challenges and opportunities for future research and applications in this field.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, June 18). Evaluating GUI Assistants with Video-Based Benchmarks. AZoAi. Retrieved on July 01, 2024 from https://www.azoai.com/news/20240618/Evaluating-GUI-Assistants-with-Video-Based-Benchmarks.aspx.

  • MLA

    Osama, Muhammad. "Evaluating GUI Assistants with Video-Based Benchmarks". AZoAi. 01 July 2024. <https://www.azoai.com/news/20240618/Evaluating-GUI-Assistants-with-Video-Based-Benchmarks.aspx>.

  • Chicago

    Osama, Muhammad. "Evaluating GUI Assistants with Video-Based Benchmarks". AZoAi. https://www.azoai.com/news/20240618/Evaluating-GUI-Assistants-with-Video-Based-Benchmarks.aspx. (accessed July 01, 2024).

  • Harvard

    Osama, Muhammad. 2024. Evaluating GUI Assistants with Video-Based Benchmarks. AZoAi, viewed 01 July 2024, https://www.azoai.com/news/20240618/Evaluating-GUI-Assistants-with-Video-Based-Benchmarks.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Mitigating Semantic Drift in AI Language Models