In a paper submitted to the arXiv* server, researchers introduced a framework to synthesize plausible human grasping motions, enabling large-scale training for human-to-robot handovers without expensive motion capture data.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Hand-Object Motion Synthesis
Capturing diverse real-world human motions to train robot handover policies is prohibitively inefficient, requiring extensive setups with markers, cameras, and post-processing. However, recent advances in generative hand-object interaction synthesis methods promise to automatically generate natural grasping motions at scale without costly motion capture.
The proposed framework builds on D-Grasp, a reinforcement learning-based technique for synthesizing dynamic hand-object manipulation sequences. D-Grasp takes grasp pose references as input to guide the motion generation process. To enable utilizing D-Grasp for robotic handover training, suitable grasp poses tailored for handovers first need to be supplied to the model. The core challenge is generating grasp poses with control over the approach direction, as humans tend to offer objects to robots in a handover-friendly manner. Furthermore, the grasp generation must generalize reliably to arbitrary objects rather than just seen instances.
The researchers propose an optimization-based grasp generator focused on producing handover-suitable grasps to address this. Given an object mesh, it outputs a pre-grasp pose and an optimized grasp reference based on the object geometry and a specified grasp direction vector. This allows explicitly controlling the object approach direction during motion synthesis to match natural human handovers. Since the grasp optimization is non-learning based, it can generalize to new objects, unlike data-driven grasp predictors reliant on distribution matching.
The optimization is designed to mimic key aspects of human pre-grasping and establish stable contact using gripper-like fingers formed by the thumb and other fingers. Multiple losses, including hand-object penetration, touch, and fingertip distances, are combined to imitate closing. The system can synthesize diverse human grasping motions tailored for handovers without needing accurate human motion capture data. It generates optimized grasp poses as references on many objects and passes them to D-Grasp.
The model observation space provides additional object shape information to improve D-Grasp's generalization capability to novel objects at test time. Specifically, signed distance values sampled across the object surface are appended to the default D-Grasp observations. Combined with the optimized grasp references, this shape conditioning enables D-Grasp to produce more natural and stable human motions when given new objects beyond the training distribution.
Through experiments, the researchers demonstrate that the synthesized motions exhibit handover-friendly approaching directions and plausible hand-object interaction for training robotic handover policies. The method constructs a large-scale synthetic dataset of human grasping sequences without requiring actual motion capture data by generating grasp references on over 1000 object models from ShapeNet and mirroring motions for left and right hands.
Expert Demonstrations
The synthetic human grasping data is leveraged to train a robotic handover policy in a simulated environment following prior work. The robot comprises a 7-Degrees of Freedom (DoF) arm with a parallel jaw gripper and wrist-mounted Red-Green-Blue Depth (RGB-D) camera, providing raw sensory inputs. The policy model is a neural network that takes the point cloud segmentation of the scene as input and predicts low-level actions to control the robot during the handover process. The training follows a two-stage procedure that combines pre-training with planning-based expert demonstrations and fine-tuning via reinforcement learning.
In the initial pre-training stage, the simulated human is stationary and has paused movement when the robot starts actuating. This simplified setting allows the generating of supervised training data by recording state-action trajectories using motion planning and grasp selection models. The motion planner enables collision-free reaching, while the grasp model provides feasible grasp poses on the target object using the ACRONYM model. During pre-training, grasp poses are selected to approach the object from the opposite side of the specified human grasp direction to avoid collisions between the robot and the simulated human hand during pre-training. This pre-training provides a strong initialization for the robot policy before the challenging interactive fine-tuning stage.
Reinforcement Learning Fine-Tuning
In the second fine-tuning stage, the human hand and robot arm move simultaneously concurrently, requiring an interactive policy from the robot agent. The pre-trained model reacts to the static input point cloud in an open-loop manner. To enable learning closed-loop policies, the pre-trained network is utilized as an expert to provide additional supervision alongside reinforcement learning rewards for policy fine-tuning in this dynamic environment—furthermore, losses based on imitation and consistency with the expert aid in stabilizing the fine-tuning phase. The complete policy training methodology follows the procedure developed in prior work, using actor-critic reinforcement learning augmented with auxiliary losses extracted from the experts.
The researchers performed extensive evaluations of the trained robot policies in simulation environments following the HandoverSim benchmark procedures. The results demonstrate that training a policy purely on synthetically generated human motion data can perform similarly to training on natural human motion capture data from the DexYCB dataset. This is a crucial insight indicating the realism and effectiveness of the proposed grasp optimization and D-Grasp pipeline for generating plausible handover motions.
Furthermore, tests on a large-scale synthetic test set containing thousands of unseen objects and randomized synthetic human motions reveal substantially superior generalization for policies trained solely with synthetic data compared to the best baseline method. While performance understandably declines across the board on this challenging test set with wide object shape variety, training on the large-scale synthetic motions improves the relative handover success rate by over 20% compared to the top baseline trained on accurate data. This confirms the benefits of diversified synthetic training for improving generalization capabilities to novel objects and human behaviors at test time.
Real Robot Evaluations
The robot control policy trained purely on synthetic human motion data is also transferred to a physical robot platform to assess real-world viability. In a user study experiment, human participants performed handovers with the artificial data-trained system and another policy trained on accurate DexYCB human data. The results found that participants needed help to reliably differentiate between the two systems based on their interaction experience.
Both systems exhibited over 85% handover success rates during the study. This suggests that the synthetic motions and resulting policy exhibit characteristics highly similar to accurate data despite being trained only in simulation. The qualitative real-world evaluations imply the naturalness and plausibility of the proposed synthesis method.
Future Outlook
The work demonstrates a promising approach towards fully simulation-based training for robotic handovers without requiring any expensive and cumbersome motion capture of real humans. By procedurally generating suitable and diverse human grasping motions at scale, the robot policy learns to handle more variability during training, potentially improving generalization to accommodate novel objects and human behaviors at test time.
Unlike data-driven grasp prediction models that struggle with out-of-distribution instances, the non-learning-based optimization strategy for generating handover-friendly grasp poses can reliably generalize to new objects. Furthermore, the additional conditioning provided to D-Grasp enables adapting the powerful motion synthesis capabilities to previously unseen objects. According to the authors, promising extensions for future work include integrating full-body motion models and multi-handed human interactions to enrich the diversity of synthetic training data further.
Overall, the study makes a compelling case that large-scale, procedurally generated data can help unlock the potential of simulation-based training for robotic handover systems that interact seamlessly with humans. The experiments further suggest that as foundation models continue advancing generalized content synthesis abilities, leveraging them could prove fruitful for automating diverse data generation to train more capable robotic agents. Beyond just human-to-robot handovers, the principles demonstrated could aid in scaling up simulation-based learning across a broader range of human-robot collaborative skills.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Christen, S., Feng, L., Yang, W., Chao, Y.-W., Hilliges, O., & Song, J. (2023, November 9). SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot Handovers. ArXiv.org. https://doi.org/10.48550/arXiv.2311.05599, https://arxiv.org/abs/2311.05599