AndroidArena: Evaluating Large Language Models on Operating Systems

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Feb 14 2024

In a recent submission to the ArXiV* server, researchers outlined the challenges large language models (LLMs) face when applied to general-purpose software systems like operating systems. They identified three primary obstacles: managing a vast and dynamic action space, coordinating inter-application tasks, and aligning solutions with user constraints. They introduced AndroidArena, an environment, and benchmark for evaluating LLM agents on modern operating systems to tackle these issues.

*Study: AndroidArena: Evaluating Large Language Models on Operating Systems. Image credit: Panya_photo/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

The authors constructed the benchmark to address human resources costs using a scalable and semi-automated approach. AndroidArena employs accurate metrics to evaluate LLM agents' performance, revealing shortcomings in cross-application scenarios and adhering to specific constraints. The study pinpointed four key capabilities—understanding, reasoning, exploration, and reflection—as crucial for LLM success. The empirical analysis highlighted the deficiency in reflection and proposed an exploration strategy, boosting success rates significantly. This research sheds light on LLM weaknesses and lays a foundation for future investigations in the field.

Related Work

Past work has demonstrated the potential of LLMs in understanding human intent and reasoning, leading to their utilization as intelligent agents in various domains. However, applying LLMs to complex software systems like operating systems presents unique challenges, including vast and dynamic action spaces, cross-application collaboration demands, and considerations for user preferences and security concerns. Additionally, ensuring LLM agents maintain up-to-date understanding and deliver accurate responses remains a significant concern in real-world scenarios.

Cross-APP Methodology: Dynamic

The AndroidArena environment is characterized by its vast and dynamic action space, facilitating cross-application (cross-APP) interactions and constrained task execution. To automate mobile tasks within this environment, we define a contextual Markov decision process (CMDP), where the agent interprets user instructions and performs actions on the phone accordingly. Based on the user interface automator (UIAutomator), the implementation provides flexible configurations to render APP page content, focusing exclusively on the text modality for LLM agents. The researchers compressed the textual XML description of the phone screen to address context length limitations, enabling the agent to comprehend UI layouts via text.

The action space within AndroidArena is expansive and constantly evolving, reflecting the variability of UI components across multiple APPs. Researchers categorized actions into four groups: APP-level, component-level, system-level, and task-level actions, each addressing different aspects of phone operation.

To ensure scalability and realism in task generation, researchers introduce the mobile task generator (MTG), which constructs tasks covering single-APP, cross-APP, and constrained scenarios. Leveraging insights from human discussions and experiences, we formulate queries to extract diverse APP functionalities. Researchers use these functionalities to generate initial task instructions, which are further expanded and refined through iterative evolution strategies.

Proficient annotators verify and annotate tasks, ensuring task accuracy and feasibility. Researchers introduced constrained tasks to assess the agent's ability to handle user-defined constraints, categorized as APP-level, page-level, or component-level constraints. These tasks are selected from the single-APP task set and manually labeled with natural language constraints, comprehensively evaluating the agent's decision-making capabilities within constrained environments.

LLM Agent Evaluation Overview

In evaluating LLM agents within the Android environment, precise metrics are crucial for comprehensively understanding their performance. Existing metrics in multi-step decision-making scenarios often need more precision, hindering a full assessment of LLM agent capabilities. Researchers proposed a novel set of metrics to overcome these limitations, focusing on adaptive and precise task completion evaluation. These metrics consider action sequences and introduce concepts like task reward (TR), task completion ratio (TCR), reversed redundancy ratio (RRR), and success rate (SR) to provide a more nuanced evaluation.

Additionally, researchers investigate the root causes contributing to the success or failure of LLM agents' planning abilities. Drawing from reinforcement learning (RL) principles, the study identifies four key dimensions—understanding, reasoning, exploration, and reflection—to assess agent capabilities. These dimensions are essential for comprehensively evaluating the LLM agents' performance in complex decision-making scenarios within the Android environment.

Experimental findings highlight significant performance gaps among state-of-the-art (SOTA) agents across various task types, including single-APP, cross-APP, and constrained tasks. Notably, while some agents demonstrate proficiency in single-APP tasks, their performance diminishes in cross-APP scenarios, indicating the complexity of multi-application interactions. Moreover, researchers identified deficiencies in handling constraints and topological logic, underscoring the need for further advancements in LLM agent capabilities.

Finally, a fine-grained analysis reveals weaknesses in LLM agents, such as challenges in adhering to action rules, identifying correct actions, and exploring alternative pathways. While reflexion mechanisms and exploration strategies promise to improve agent performance, there remains room for enhancing understanding, reasoning, and exploration abilities to address the identified deficiencies effectively.

Conclusion

In conclusion, this study introduced the AndroidArena environment and a scalable benchmark, facilitating the evaluation of cross-APP and constrained task scenarios. Researchers proposed adaptive and precise metrics to assess task completion and fine-grained agent abilities, revealing significant areas for improvement among SOTA agents.

Four research directions were outlined to enhance LLM agents, alongside empirical insights into the limitations of reflection mechanisms and a novel method to improve exploration capabilities. Researchers planned future investigations to address weaknesses in multi-modal model agents, leveraging the versatility of the AndroidArena platform for multi-modal evaluation.

Journal reference:

Preliminary scientific report. Xing, M. et al. (2024). Understanding the Weakness of Large Language Model Agents within a Complex Android Environment. ArXiv. https://arxiv.org/abs/2402.06596.

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2024, February 14). AndroidArena: Evaluating Large Language Models on Operating Systems. AZoAi. Retrieved on July 18, 2025 from https://www.azoai.com/news/20240214/AndroidArena-Evaluating-Large-Language-Models-on-Operating-Systems.aspx.
MLA
Chandrasekar, Silpaja. "AndroidArena: Evaluating Large Language Models on Operating Systems". AZoAi. 18 July 2025. <https://www.azoai.com/news/20240214/AndroidArena-Evaluating-Large-Language-Models-on-Operating-Systems.aspx>.
Chicago
Chandrasekar, Silpaja. "AndroidArena: Evaluating Large Language Models on Operating Systems". AZoAi. https://www.azoai.com/news/20240214/AndroidArena-Evaluating-Large-Language-Models-on-Operating-Systems.aspx. (accessed July 18, 2025).
Harvard
Chandrasekar, Silpaja. 2024. AndroidArena: Evaluating Large Language Models on Operating Systems. AZoAi, viewed 18 July 2025, https://www.azoai.com/news/20240214/AndroidArena-Evaluating-Large-Language-Models-on-Operating-Systems.aspx.