Researchers unveil iDP3, an advanced 3D visuomotor system allowing humanoid robots to autonomously perform practical skills, overcoming previous limitations in complex environments.
Humanoid manipulation in diverse unseen scenarios. With only data collected from a single scene, our Improved 3D Diffusion Policy (iDP3) enables a full-sized humanoid robot to perform practical skills in diverse real-world environments. The scenes are not cherry-picked.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers from Stanford University, Simon Fraser University, University of Pennsylvania (UPenn), University of Illinois at Urbana-Champaign (UIUC), and Carnegie Mellon University (CMU) introduced the improved 3D diffusion policy (iDP3), a 3D visuomotor policy designed to enhance autonomous humanoid robot manipulation in diverse environments.
iDP3 overcame the limitations of previous policies by using egocentric 3D visual representations directly from the robot's camera frame, eliminating the need for camera calibration and point-cloud segmentation. The researchers demonstrated that iDP3 enabled full-sized humanoid robots to perform various tasks autonomously in real-world scenarios, using only lab-collected data.
Background
Past work in visuomotor policy learning primarily relied on state estimation or focused on image-based imitation learning, which struggled to generalize in complex environments. Recent methods, like the 3D DP3, showed improved generalization but remained dependent on precise camera calibration and segmentation. Other approaches, such as Maniwhere and the Robot utility model, required large datasets or complex pipelines for scene generalization.
Improved Humanoid Robot Manipulation System
The DP3 is a 3D visuomotor policy designed for robotic manipulation tasks, but its reliance on precise camera calibration and point cloud segmentation limits its applicability to general-purpose robots like humanoids. The iDP3 was developed to overcome these challenges, leveraging egocentric 3D visual representations and eliminating the reliance on camera calibration. Scaling up the vision input and incorporating more sample points captured the entire scene, enhancing task performance. Additionally, iDP3 replaced the original multilayer perceptron (MLP) visual encoder with a pyramid convolutional encoder, which improved behavior learning from human demonstrations and increased prediction accuracy by extending the prediction horizon.
Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies
iDP3 was implemented on a full-sized humanoid robot, Fourier General Robot 1 (GR1). It was equipped with two inspire hands and a light detection and ranging (LiDAR) camera mounted on its head for egocentric vision. A height-adjustable cart navigated the robot while the lower body was turned off for stability, eliminating the need for complex whole-body control.
The RealSense L515 LiDAR camera captured high-quality 3D point clouds, offering more accurate depth sensing than previous models. However, challenges with the sensor were noted, particularly the CPU demand that caused teleoperation latency. Additionally, the camera's limitations in generating perfectly accurate point clouds were noted, and further improvements in whole-body control were anticipated for future deployments.
For data collection, the robot's upper body was teleoperated using Apple Vision Pro (AVP), which highly tracked human hand, wrist, and head poses. The system also streamed real-time feedback to the teleoperator, and including the robot's waist in the teleoperation enhanced flexibility. Relaxed IK enabled the robot to follow these poses while the vision system streamed real-time feedback to AVP. Despite this improved flexibility, challenges such as teleoperation latency due to the CPU load of the LiDAR sensor were encountered. The system collected high-quality trajectories, including visual and proprioceptive data for imitation learning.
The iDP3 model was trained on these collected human demonstrations and deployed without camera calibration or manual point cloud segmentation, allowing seamless adaptation to new scenes. This system, tested in complex real-world environments, none of which were cherry-picked, demonstrated the effectiveness of egocentric 3D representations for humanoid robot manipulation. Videos showcasing iDP3's performance in diverse environments are available on the project website.
iDP3 Performance and Improvements
To evaluate the effectiveness of the iDP3 system, a Pick&Place task was used as the primary benchmark. This task involves the robot grasping a lightweight cup and moving it aside, which poses a challenge because the cup's size is similar to the robot's dexterous hands.
The task requires high precision to avoid collisions, and it was trained in four settings: egocentric and third-person views, with varying numbers of demonstrations. Each method was tested over 130 trials, and performance metrics included the number of successful grasps and the total attempts, reflecting both accuracy and smoothness of execution.
Compared with several strong baselines, iDP3 outperformed traditional DP methods, including frozen reusable representations for robotic manipulation (R3M) encoder and DP3's encoder. However, DP with a fine-tuned R3M encoder proved to be a particularly strong competitor, slightly outperforming iDP3 in some settings.
Despite this, image-based methods like DP were prone to overfitting to specific objects and scenes, struggling to generalize to new environments. The paper highlights that iDP3's 3D visual observations were noisy due to hardware limitations, particularly with the depth sensor's performance, suggesting more accurate 3D observations could further enhance performance.
Ablation studies revealed that key modifications to iDP3, including improved visual encoders, scaled visual inputs, and an extended prediction horizon, were essential for achieving optimal performance. These improvements were necessary for smooth and accurate behavior learning from human data, reducing training time while scaling up the number of point clouds without sacrificing efficiency. Additionally, iDP3 demonstrated significantly reduced training time compared to DP, even when scaling up the number of point clouds, maintaining efficiency across various settings.
Conclusion
To sum up, this work presented an imitation learning system enabling a full-sized humanoid robot to generalize practical manipulation skills in diverse real-world environments using data collected solely in the lab. The iDP3 showcased impressive generalization capabilities.
However, limitations included the tiring nature of teleoperation with AVP, noisy point clouds from the depth sensor, time-consuming data collection for fine-grained manipulation skills, and challenges in maintaining balance due to the robot's use of its lower body. The CPU demands of the LiDAR sensor, which caused teleoperation latency, were also noted as a barrier to scaling data collection. Future work aimed to explore scaling up the training of 3D visuomotor policies with high-quality data.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference:
- Preliminary scientific report.
Yanjie Ze et al. (2024). Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies. Arxiv. DOI: 10.48550/arXiv.2410.10803, https://arxiv.org/html/2410.10803v1