In an article recently submitted to the ArXiv* server, researchers proposed a new approach to robotic exploration in dynamic environments. They introduced the concept of interactive scene exploration, where robots autonomously navigated and interacted with surroundings to create action-conditioned scene graphs (ACSG).
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The ACSG captured low-level details like geometry, semantics, and high-level relationships between objects based on actions. Their robotic exploration (RoboEXP) system utilized the large multimodal model (LMM) and explicit memory to enhance exploration capabilities. The robot incrementally constructed the ACSG while accumulating new information by reasoning about what and how to explore objects. The effectiveness of RoboEXP was demonstrated across various real-world scenarios, showcasing its ability to facilitate manipulation tasks involving different types of objects, from rigid to deformable ones.
Related Work
Past work in robotics has primarily focused on exploring static environments or limited interactions with specific object categories or actions. However, this approach has encountered several limitations and challenges. Existing methods often need more adaptability to dynamic environments and more exploration capabilities and may overlook regions requiring active interaction.
Moreover, they often rely on a narrow scope of predefined actions, inefficiently gather information, and need help with scalability to more complex tasks. These issues highlight the need for advancements in interactive scene exploration to overcome these challenges and enable robots to navigate and interact in real-world environments effectively.
RoboEXP System Overview
This section outlines the RoboEXP system's structure, including perception, memory, decision-making, and action modules. Collectively, these components enable autonomous exploration of unknown environments, emphasizing closed-loop processes that accommodate multi-step reasoning and potential interventions.
Researchers designed the RoboEXP system to explore unknown environments autonomously by observing and interacting with them. It consists of four key components: perception, memory, decision-making, and action modules. Raw RGBD images captured through a wrist camera undergo processing by the perception modules to extract scene semantics, such as object labels, 2D bounding boxes, and segmentations. The memory module then receives the semantic information, merging the 2D data into a 3D representation. This 3D information guides the decision module in selecting appropriate actions to explore further or observe the environment while the action module executes the planned actions, generating new observations.
The perception module employs advanced techniques like grounding deep interpolation network for object detection (GroundingDINO) and segmenting anything in high-quality (SAM-HQ) for object detection and segmentation and CLIP for extracting semantic features. The memory module constructs the ACSG of the environment by assimilating observations over time. It employs voxel-based representations for efficient computation and memory updates, handling merging across different viewpoints and time steps. The decision module utilizes a large multimodal model (LMM), such as a generative pre-trained transformer (GPT-4V), for action proposal and verification, effectively guiding the system in choosing efficient actions.
The action module focuses on constructing the ACSG through interaction with the environment, employing heuristic-based action primitives. It dynamically plans and adapts actions in a closed-loop manner, enabling continuous exploration based on environmental feedback. Additionally, the system incorporates an action stack for managing multi-step reasoning and prioritizing actions based on decisions from the decision module. Finally, to maintain scene consistency, a greedy strategy is employed to return objects to their original states after exploration, ensuring practicality for real-world applications.
RoboEXP System Evaluation Analysis
This section evaluates the performance of the RoboEXP system in various tabletop scenarios for interactive scene exploration. The experiments aim to answer fundamental questions regarding the system's effectiveness and utility in facilitating downstream tasks. The assessment compares the system's performance against a baseline, considering success rate, object recovery, state recovery, unexplored space, and graph edit distance.
All experiments are conducted in a real-world setting, utilizing a RealSense-D455 camera mounted on the robot arm and a universal factory robotic arm (UFACTORY xArm 7) to execute actions. The experimental setup encompasses diverse objects, providing a realistic system testing environment.
The system's efficacy in various exploration scenarios is evaluated by comparing it with a baseline, augmented GPT-4V with ground truth actions. Researchers design five types of experiments, each comprising ten different settings that vary in object number, type, and layout. Quantitative and qualitative analyses demonstrate the system's superiority in constructing comprehensive ACSG across diverse tasks.
The scenarios exemplify the efficacy of the generated ACSG in manipulation tasks and its capability to adapt to environmental changes autonomously. The ACSG not only enhances downstream manipulation tasks but also assists in recognizing task feasibility and seamlessly adapting to human interventions.
Despite its effectiveness, there is room for improvement in the system, particularly in addressing failures arising from detection and segmentation errors in the perception module. Future directions include enhancing visual foundation models for semantic understanding and integrating sophisticated skill modules to improve decision-making and action execution.
Conclusion
In summary, researchers introduced RoboEXP as a robust robotic exploration framework powered by foundation models. It effectively identifies all objects in complex scenes, whether directly observable or revealed through interaction, utilizing an action-conditioned 3D scene graph.
Experiments demonstrated RoboEXP's superiority in interactive scene exploration, surpassing a GPT4V-based solid baseline. The reconstructed scene graph is pivotal for guiding complex downstream tasks, such as breakfast preparation in diverse environments. The system paves the way for practical robotic deployment in households and offices, enhancing everyday usability.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.