In an article recently submitted to the ArXiv* server, researchers proposed AutoGPT+P for task planning using large language models (LLMs).
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Natural language interaction between robots and humans has been empirically validated as highly efficient. LLMs have been confirmed effective as zero-shot learners and have gained substantial attention in cognitive computing research. However, LLMs cannot translate a natural language instruction into a plan directly for robotic task execution owing to their limited reasoning capabilities.
Recent task planning advancements leverage LLMs to enhance generalizability by combining these language models with classical planning algorithms to address the LLMs' inherent constraints in reasoning capabilities. However, these approaches must dynamically capture the initial task planning problem state, which is a major challenge.
The proposed AutoGPT+P
In this study, researchers proposed AutoGPT+P, a system that merges an affordance-based scene representation with a planning system, to address the challenge. This system allows users to command robots using natural language and executes and derives a plan to fulfill the user's request even when the objects required to perform the task do not exist within the immediate environment.
AutoGPT+P contains two stages, with the first stage involving visual data-based scene affordance extraction and perceiving the environment as a set of objects. The second stage involves task planning based on the user's specified goal and the established affordance-based scene representation. Eventually, AutoGPT+P will employ an LLM to select tools that assist in generating a plan for task completion.
Affordances cover an agent's action possibilities on the environment and the objects present within the environment. Thus, the planning domain can be derived from an affordance-based scene representation for symbolic planning with arbitrary objects.
The proposed AutoGPT+P system leveraged this representation to execute and derive a plan for a user-specified task in natural language. AutoGPT+P can solve planning tasks under a closed-world assumption and handle planning with inadequate information, such as handling tasks with missing objects by providing a partial plan, suggesting alternatives, or exploring the scene.
Specifically, the system displays dynamic responsiveness as it explores the environment for missing objects or advances toward a sub-goal when confronted with such limitations. The affordance-based scene representation combines object detection with an object-affordance-mapping (OAM) generated automatically using ChatGPT/the first stage of AutoGPT+P.
OAM defines the relations between object classes and the affordance set related to the instances of those classes. The task planning approach in the second stage generates partial plans, plans, explores, and suggests alternatives to achieve a task goal based on an LLM-based tool selection and the established OAM. Moreover, the core planning tool can automatically correct syntactic and semantic errors. AutoGPT+P enables the robot to seek human assistance when encountering issues while executing actions required to reach the goal.
Evaluation and findings
Researchers evaluated the proposed approach in simulation using 150 scenarios with various tasks to accomplish, such as sorting, wiping, heating, chopping, pouring, handover, and picking and placing. They also performed real-world validation experiments using humanoid robots ARMAR-DE and ARMAR-6 that demonstrated a subset of these tasks.
Initially, the performance of the object-affordance mapping (OAM) was evaluated on the proposed affordances. Then, the success rate of the suggest alternative tool was assessed against a naive alternative suggestion. Subsequently, the plan tool was compared against SayCan on the SayCan instruction set and on the evaluation set created by researchers in this study containing 150 tasks before assessing the entire AutoGPT+P system with tool selection scenarios.
Precision, recall, and F1-score were used as the evaluation metrics for the OAM evaluation. In the comparison of the suggest alternative approach with a naive alternative suggestion approach, the performance of both approaches was evaluated using 30 predefined scenarios with three difficulty levels.
Each scenario included a list of permitted objects, the objects in the scene, the user-specified task, and the missing object. The task was considered accomplished when the method provided one of the allowed alternatives. The plan tool demonstrated significantly better performance compared to the current state-of-the-art LLM-based planning method SayCan on SayCan's instruction set, based on planning without considering execution. Results showed that the plan tool had a success rate of 98%, surpassing the existing 81% success rate of SayCan.
The self-correction of syntactic and semantic errors significantly influenced the success rates by increasing it from 79% to 98%. Additionally, the proposed affordance-guided suggest alternative tool outperformed a naive approach by 13% in scenes with 20 and 70 objects. Moreover, the evaluation of the AutoGPT+P system's overall performance showed that the system achieved a 79% success rate on the newly created dataset with 150 scenarios/tasks.
Overall, the generated plans were successfully executed on the robot, and the symbolic representation of objects from the planning domain was transferred to the subsymbolic object representations required for skill execution.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.