Meta’s PARTNR Benchmark Redefines Human-Robot Collaboration

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonNov 5 2024

With 100,000 diverse tasks, PARTNR challenges AI models to tackle real-world scenarios, pushing the boundaries of robot collaboration and efficiency in everyday environments.

PARTNR, a benchmark for planning and reasoning in embodied multi-agent tasks, featuring 100,000 everyday tasks and evaluation functions generated semi-automatically, spanning 60 houses and 5,819 unique objects. We analyze LLM-based planning agents and also provide a human-in-the-loop tool to evaluate how agents collaborate with real humans.

In an article recently posted to the Meta Research website, researchers introduced a new AI benchmark called PARTNR, designed to assess human-robot collaboration on household tasks. The PARTNR benchmark consists of over 100,000 tasks across various homes and objects, focusing on key collaboration areas such as planning, perception, and execution.

Using large language models (LLMs) in simulated environments, the study highlighted critical limitations in current models for coordination and error recovery. However, experiments with fine-tuning smaller LLMs showed promise, offering performance levels that rival larger models while reducing operational costs. PARTNR is designed to push the boundaries of collaborative robot research in everyday environments.

Background

Collaborative human-robot tasks in household environments demand sophisticated interaction and coordinated planning. Traditional embodied AI benchmarks often lack essential elements for natural human-robot collaboration, either restricting robots to isolated tasks or avoiding natural language task instructions. This gap limits their applicability in evaluating realistic, dynamic multi-agent interactions.

This study introduced PARTNR—a large-scale natural language benchmark emphasizing complex human-robot collaboration in household tasks to address these gaps. The PARTNR dataset spans 100,000 unique tasks across various settings and object interactions, encompassing four task types that require spatial, temporal, and specific agent actions. Unlike previous benchmarks, PARTNR prioritizes task coordination, tracking partner actions, and adapting to multi-agent scenarios. The tasks were generated and validated using a combination of LLMs and realistic simulations.

The benchmark exposed key limitations in LLM-based models, especially in continuous task tracking and error recovery. PARTNR aims to drive improvements in embodied AI by testing models in complex, long-duration tasks that closely mirror real-world scenarios.

Benchmark Generation

PARTNR was designed to evaluate robots' abilities to interpret and complete tasks in natural language alongside humans. It features four task categories: constraint-free tasks, where sub-tasks could be completed flexibly; spatial tasks, requiring precise spatial reasoning; temporal tasks, where task sequence plays a critical role; and heterogeneous tasks, involving actions that may be beyond a single agent’s capabilities.

The benchmark included an initial set of 1,000 human-verified instructions, which were expanded to 100,000 tasks using LLMs and advanced simulated environments. By leveraging Habitat 3.0 and the Habitat Synthetic Scenes Dataset (HSSD), PARTNR’s task generation process situates tasks within realistic environments, enhancing their practical relevance and reducing common LLM output issues.

Simulation-in-the-loop generation reduced errors like hallucinations and non-viable actions in LLM outputs. A curated set of human-verified instructions served as a base to scale tasks, maintaining a rich diversity aligned with real-world environments. The system included automated evaluation functions that verified task completion without manual input, analyzing if agents satisfied key constraints such as task sequence and proposition requirements.

The final dataset, comprising 100,000 highly diverse tasks across varied environments, provides a robust foundation for assessing collaborative robotics in tasks requiring advanced object manipulation, spatial reasoning, and complex task sequencing. PARTNR thus establishes a benchmark for scalable, realistic task instruction generation and evaluation, marking significant progress in human-robot collaboration.

Experiment and Analysis

The authors investigated state-of-the-art LLMs for their effectiveness in planning and facilitating human-robot collaboration using the PARTNR benchmark. PARTNR tasks were carried out in the Habitat 3.0 simulation, where both a robot and a human agent operated under a decentralized, two-tiered control structure. High-level LLM planners managed skills such as navigation and manipulating objects, while the robot executed finer-grained skills.

The researchers evaluated several planning approaches, including zero-shot, retrieval-augmented generation (RAG), and fine-tuned LLMs. Notably, the fine-tuned 8B model performed comparably to a 70B model but was 8.6 times faster, suggesting practicality for real-world deployment. Key experimental conditions varied in dimensions such as centralized and decentralized planning, observability levels, and the use of either privileged or learned low-level skills.

Results indicated that while LLMs demonstrated potential in human-robot tasks, they faced coordination challenges and struggled with effective skill recovery and accurate perception, impacting overall task success rates. Decentralized LLM-based planners often showed inefficiencies, with increased task completion time and more extraneous actions compared to centralized approaches. Furthermore, LLMs frequently encountered difficulties recovering from skill errors and managing perception inaccuracies, reducing task success rates.

Human-in-the-loop evaluations revealed that humans significantly outperformed LLMs in PARTNR tasks, achieving a 0.93 success rate compared to only 0.30 by LLMs alone. Nevertheless, fine-tuned LLMs collaborating with humans achieved higher success rates, effectively offloading up to 26% of tasks from human partners. Despite these gains, LLMs still lagged in efficient task coordination, highlighting the need for further advancements in LLM abilities for real-world, collaborative applications.

Conclusion

In conclusion, the researchers introduced PARTNR, a benchmark intended to elevate human-robot collaboration in household tasks, incorporating over 100,000 diverse tasks. They focused on three core aspects—planning, perception, and execution— using LLMs with simulation to identify the current limitations of existing models, especially around coordination and error recovery.

While fine-tuning smaller LLMs yielded comparable performance to larger models, the results strongly indicated that humans outperformed current LLM approaches by a substantial margin, achieving a success rate of 93% compared to just 30% for LLMs. Ultimately, PARTNR aims to propel collaborative robotics research, underscoring the need for improved LLM capabilities to support effective human-robot interactions in complex, real-world scenarios.

Sources:

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks - https://ai.meta.com/research/publications/partnr-a-benchmark-for-planning-and-reasoning-in-embodied-multi-agent-tasks/
Code: https://github.com/facebookresearch/partnr-planner
Website: https://aihabitat.org/partnr
Paper: scontent.fsyd7-1.fna.fbcdn.net

Posted in: AI Product News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, November 05). Meta’s PARTNR Benchmark Redefines Human-Robot Collaboration. AZoAi. Retrieved on April 19, 2025 from https://www.azoai.com/news/20241105/Metae28099s-PARTNR-Benchmark-Redefines-Human-Robot-Collaboration.aspx.
MLA
Nandi, Soham. "Meta’s PARTNR Benchmark Redefines Human-Robot Collaboration". AZoAi. 19 April 2025. <https://www.azoai.com/news/20241105/Metae28099s-PARTNR-Benchmark-Redefines-Human-Robot-Collaboration.aspx>.
Chicago
Nandi, Soham. "Meta’s PARTNR Benchmark Redefines Human-Robot Collaboration". AZoAi. https://www.azoai.com/news/20241105/Metae28099s-PARTNR-Benchmark-Redefines-Human-Robot-Collaboration.aspx. (accessed April 19, 2025).
Harvard
Nandi, Soham. 2024. Meta’s PARTNR Benchmark Redefines Human-Robot Collaboration. AZoAi, viewed 19 April 2025, https://www.azoai.com/news/20241105/Metae28099s-PARTNR-Benchmark-Redefines-Human-Robot-Collaboration.aspx.