In an article recently submitted to the arXiv* server, researchers introduced a new task for embodied artificial intelligence (AI) called human-aware vision-and-language navigation (HA-VLN). This task aims to bridge the gap between simulation and reality in vision-and-language navigation (VLN). To support this, they developed a realistic simulator named human-aware 3D (HA3D) and created two navigation agents.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
VLN is a benchmark for evaluating the simulation to real (Sim2Real) transfer capabilities of embodied AI agents, which learn from their environments. In VLN, an agent follows natural language instructions to navigate to a specific location within a three-dimensional (3D) space.
However, most existing VLN frameworks operate under simplifying assumptions like static environments, optimal expert supervision, and panoramic action spaces. These constraints limit their applicability and robustness in real-world scenarios. Additionally, these frameworks often overlook the nature of human activities in populated environments which limits their effectiveness in navigating real-world settings.
About the Research
In this paper, the authors developed HA-VLN, a novel task that extends VLN by incorporating human activities. HA-VLN involves an agent navigating environments with human activities, guided by natural language instructions. This task adopts an egocentric action space with a 60-degree field of view, mirroring human-like visual perception. It also integrates dynamic environments and 3D human motion models using the skinned multi-person linear (SMPL) model to capture realistic human poses and shapes. Furthermore, HA-VLN employs sub-optimal expert supervision, enabling the agent to learn from imperfect demonstrations of an adaptive policy, thus better handling real-world tasks with imperfect instructions.
To support HA-VLN, the researchers developed the HA3D simulator, which integrates dynamic human activities from the custom human activity and pose simulation (HAPS) dataset with photorealistic environments from the Matterport3D dataset. The HAPS dataset includes 145 human activities converted into 435 3D human motion models.
HA3D combines these human motion models with Matterport3D to create diverse and challenging navigation scenarios. It features an annotation tool for placing each human model in various indoor regions across 90 building scenes and uses Pyrender to render dynamic human bodies with high visual realism. HA3D also provides interfaces for agent-environment interaction, including first-person RGB-D video observation, navigable viewpoints, and human collision feedback.
Additionally, the study introduced the human-aware room-to-room (HA-R2R) dataset, an extension of the popular room-to-room (R2R) VLN dataset. HA-R2R incorporates human activity descriptions, resulting in 21,567 instructions with 145 activity types, categorized as start, obstacle, surrounding, and end based on their positions relative to the agent’s starting point. Compared to R2R, HA-R2R features longer average instruction lengths and a larger vocabulary, reflecting the increased complexity and diversity of the task.
Research Findings
To address the challenges of HA-VLN, the study introduced two multimodal agents designed to effectively integrate visual and linguistic information for navigation. The first agent called expert-supervised cross-modal (VLN-CM), is an LSTM-based sequence-to-sequence model. It learns by imitating expert demonstrations. The second agent named non-expert-supervised decision transformer (VLN-DT), is an autoregressive transformer model.
The study evaluated the performance of the HA-VLN task using metrics that considered both human perception and navigation aspects. The outcomes revealed that HA-VLN posed a significant challenge for existing VLN agents. Even after retraining, these agents struggled to match the oracle agent.
Furthermore, the VLN-DT, trained only on random data, outperformed the VLN-CM model trained under expert supervision, showcasing VLN-DT's superior generalization ability. Finally, the study demonstrated the real-world validation of these agents using a quadruped robot, which showed perception and avoidance capabilities. This highlighted the need for continued enhancement to better align with real-world scenarios.
Applications
The HA-VLN and HA3D have significant implications in embodied AI and Sim2Real transfer research. They can help develop and test navigation agents capable of operating in dynamic, human-populated environments such as homes, offices, hotels, and museums. These tools can also explore human-aware navigation strategies, including adaptive responses and social norms, and enhance human-robot collaboration. Additionally, they can provide valuable benchmarks and insights for advancing embodied AI and Sim2Real transfer to develop more realistic and effective VLN systems.
Conclusion
In summary, HA-VLN represented a significant advancement in embodied AI and Sim2Real research by introducing tasks that reflect real-world dynamics. Although current models had limitations in replicating human behavior, HA-VLN provided a critical foundation for future advancements. Future work should focus on enhancing the simulator, integrating more realistic human avatars, and expanding the HA-VLN framework to outdoor environments, paving the way for advanced VLN systems in human-populated settings.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Li, M., et, al. Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions. arXiv, 2024, 2406, 19236. DOI: 10.48550/arXiv.2406.19236, https://arxiv.org/abs/2406.19236