In a recent submission to the arXiv* server, researchers proposed a system that integrates optical character recognition (OCR), augmented reality (AR), and large language models (LLMs) to enhance user performance, ensure trustworthy interactions, and reduce workload in operations and maintenance (O&M) tasks. The system utilizes a dynamic virtual environment powered by Unity, facilitating seamless interactions between the virtual and physical realms.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
AR has gained significant attention in various sectors, including gaming, healthcare, engineering, and education, due to advancements in computing power and hardware. AR-supported O&M tasks have shown promise in facilities management by overlaying digital data onto the physical environment, providing real-time information to improve task execution and decision-making.
However, challenges persist in converting context data into suitable AR visualizations, particularly in the "text-to-action" aspect of O&M tasks. The current study explores the integration of large language models (LLMs), specifically ChatGPT, to generate real-time text-to-action guidance, addressing the limitations of traditional natural language processing (NLP) methods. The integration of ChatGPT into AR systems has the potential to enhance the efficiency, accuracy, and user-friendliness of O&M procedures, as demonstrated through a prototype system and preliminary user tests.
Literature review
AR for maintenance: AR has shown potential in improving efficiency, reducing costs, enhancing safety, and increasing customer satisfaction in construction and facilities management. However, challenges such as context awareness and localization remain. Researchers have addressed these challenges by integrating computer vision techniques, point cloud data, non-vision methods, and data fusion approaches. Context understanding in AR has been improved by incorporating facility metadata from tools like BIM.
ChatGPT for content understanding: NLP techniques, including text summarization, have enhanced context understanding in AEC/FM tasks. The emergence of LLMs like ChatGPT has revolutionized information filtering and comprehension. ChatGPT's adaptability, precision, transformer-based architecture, and ability to learn and adapt to new contexts make it suitable for understanding complex O&M instructions. Its success has been demonstrated in various domains, including healthcare, education, and human-robot collaboration. Emergent properties resulting from its large-scale training are behind the high-level intelligence exhibited by ChatGPT. Leveraging ChatGPT's unique features can significantly enhance context understanding in O&M tasks.
System design
The proposed system focuses on efficiency, accuracy, and real-life application compatibility. The challenges addressed include integrating OCR, ChatGPT, and a virtual environment within the Unity game engine. The system architecture revolves around Unity, facilitating interactions and data exchanges between the virtual and physical worlds. The system converts visual inputs from 2D images to text through OCR, processes them using ChatGPT, and then sends the processed data to Unity for execution.
The system utilizes the mixed reality toolkit (MRTK) and HoloLens 2 to understand spatial information and register virtual items in the physical world. Interactions are managed by Unity, with hand interactions and virtual buttons. The design ensures accurate commands from ChatGPT and seamless interactions between the virtual and physical environments, enhancing performance in operations and maintenance tasks.
Experimentation and results
A a case study was conducted with 15 subjects aged between 18 and 30 to evaluate the proposed system. The experiment involved two conditions: "no augmentation" and "AR and GPT." The subjects were recruited publicly, and three of them had experience with AR head-mounted displays (HMDs) and hand interactions. Motion sickness was reported by two subjects when using HoloLens 2, while the remaining participants did not experience any discomfort. The experiment utilized OpenAI's "gpt-4" version. The physical workspace was represented by a box-shaped control panel, while a digital twin model was created in Unity. The subjects interacted with the AR virtual scene using MRTK 2.8 and hand interactions.
Participants completed tasks in both conditions and subsequently filled out questionnaires. The complexity of the tasks was equivalent in both conditions. All participants successfully operated the virtual buttons, and their performance was evaluated based on task completion time and interaction sequence accuracy. Participants underwent a training session before the experiment to familiarize themselves with the AR system and physical control panel. The experiment was recorded, and the completion time was derived from the recorded videos.
Results: The average completion time was significantly lower in the "AR & GPT" condition compared to the "no augmentation" condition. The accuracy of physical interaction was high in both conditions, with slightly higher accuracy observed in the "AR and GPT" condition. ChatGPT exhibited high accuracy in processing the text content. The NASA task load index (NASA-TLX) survey showed a significant difference between the two conditions, indicating that the "AR & GPT" method reduced cognitive load. The trust evaluation survey also demonstrated a significant difference, suggesting that participants trusted the virtual prompts provided by ChatGPT and AR.
Conclusion
In summary, the study revealed that integrating ChatGPT into an AR system improved complex maintenance tasks. ChatGPT-enabled AR resulted in faster completion, higher accuracy, increased trust, and reduced cognitive load compared to conventional AR. These findings have implications for various sectors, enhancing efficiency and the user experience. However, the generalizability of the study is limited to maintenance tasks, and future research should include objective measures and address technical challenges to facilitate wider implementation of ChatGPT-enabled AR.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.