In an article posted to the ArXiv* server, researchers demonstrated the feasibility of a new scalable large language model (LLM)-based task planning framework for robotics.
Background
Recent advancements in LLMs have enabled robots to plan complex strategies for different tasks that require a significant amount of semantic comprehension and background knowledge.
However, LLMs must adhere to constraints present in the physical environment where the robot operates, including the relevant predicates, the effect of actions on the current state, and available affordances, to become efficient planners in robotics.
Additionally, the robots must be able to understand their location, identify the items of interest, and realize the topological arrangement of the environment to plan across the important regions in expansive environments.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Several studies have investigated the feasibility of using planning domain definition language description of a scene, object detectors, and vision-based value functions to ground the LLM-based planner output. However, these efforts were confined to single rooms/small-scale environments with pre-encoded information on all existing objects and assets present in the environment.
Moreover, scaling these models is extremely challenging as a higher number of entities and rooms in the scene increases the complexity and expands the dimensions of the environment, which make the pre-encoding of all critical information in the LLM context increasingly infeasible.
Thus, a new scalable approach is necessary to ground the LLM-based task planners in expansive, multi-room, and multi-floor environments.
A new approach for LLM-based large-scale task planning for robots
In this paper, researchers proposed SayPlan, a new scalable approach to large-scale task planning based on LLMs for robotics using three-dimensional scene graph (3DSG) representations. The scalable approach was developed to ground the LLM-based task planners across expansive environments spanning several floors and rooms by exploiting the growing body of 3DSG research.
The study addressed the challenge of long-range planning for autonomous agents in an expansive environment based on natural language instructions. Thus, the experiments were designed to assess the 3DSG reasoning capabilities of LLMs on high-level task planning for a mobile manipulator robot.
This long-range planning involved comprehending ambiguous and abstract instructions, understanding the scene, and generating task plans for manipulating and navigating a mobile robot within an environment.
3DSGs can capture a rich hierarchically organized and topological semantic graph representation of an environment and encode the information necessary for task planning, including predicates, attributes and affordances, and object state, using natural language that can be parsed by an LLM. The JavaScript Object Notation (JSON) representation of this graph was leveraged as input to a pre-trained LLM.
Additionally, the scalability of the approach was ensured by reducing the LLM planning horizon through the integration of a classical path planner, introducing an iterative replanning pipeline to refine the initial plan using the scene graph simulator feedback to correct infeasible actions and avoid planning failures, and exploiting the hierarchical nature of 3DSGs to allow LLMs to perform a semantic search for task-relevant subgraphs from a collapsed, smaller representation of the full graph,
The proposed approach was assessed across 90 tasks that were organized into four difficulty levels, including semantic search tasks and long-horizon, interactive tasks with multi-room ambiguous objectives that require substantial common sense reasoning.
These tasks were evaluated in two large-scale environments, including a three-story house with 121 objects and 32 rooms and a large office floor with 36 rooms and 150 interactable objects and assets.
Significance of the study
The findings of this study demonstrated the effectiveness of the approach to ground long-horizon, large-scale task plans from abstract, natural language instruction for execution by a mobile manipulator robot.
SayPlan GPT-4 achieved 73.3% and 86.7% success in finding the desired subgraph across both complex and simple search tasks, respectively. Additionally, the input tokens required to represent the home environment and office environment were reduced by 60.4% and 82.1%, respectively, due to the semantic reasoning capabilities of LLMs and the hierarchical nature of 3DSGs, which allowed the agent to explore the scene graph from the highest hierarchical level.
Moreover, SayPlan attained near-perfect executability due to iterative replanning by a scene graph simulator, which ensured that the generated plans adhere to predicates and constraints imposed by the environment.
The approach produced the highest number of executable and correct plans that can be followed by a mobile robot compared to current baseline techniques. Thus, SayPlan successfully addressed two key issues, including the mitigation of LLM erroneous outputs and hallucinations while generating long-horizon plans in expansive environments and the representation of large-scale scenes within LLM token limitations.
Limitations of the approach and future outlook
The graph-based reasoning capabilities of the underlying LLM fail at node negation, node count-based reasoning, and simple distance-based reasoning, which is a significant limitation of this proposed approach.
Additionally, the current SayPlan framework requires a pre-built 3DSG and assumes that all objects remain static post-map generation, which significantly restricts the adaptability of the framework to dynamic real-world environments.
Thus, more research is required to fine-tune these LLMs for large-scale task planning for robotics, to incorporate more complex graph reasoning tools to facilitate decision-making and to integrate the online scene graph simultaneous localization and mapping (SLAM) systems within the SayPlan framework.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Rana, K., Haviland, J., Garg, S., Reid, I., Suenderhauf, N., Abou-Chakra, J. (2023). SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning. ArXiv. https://doi.org/10.48550/arXiv.2307.06135, https://arxiv.org/abs/2307.06135