In an article recently posted to the Meta Research website, researchers introduced Lumos, the first end-to-end multimodal question-answering system with text-understanding capabilities. Lumos integrated a scene text recognition (STR) component to extract text from first-person images, enhancing input for a multimodal large language model (MM-LLM).
The paper addressed STR quality, latency, and model inference challenges. It detailed the system architecture, design choices, techniques, and a comprehensive evaluation demonstrating high quality and efficiency.
Background
Past work in visual question answering has seen increased attention with recent progress in LLMs and vision-language pre-training. Industry forecasts that smart assistants will soon achieve a human-like understanding of scenes and images. Prior approaches often used MM-LLMs for text understanding in images without standalone STR components. Implementations typically involved transferring high-resolution images to the cloud for processing, which faced latency issues and degraded performance with low-resolution thumbnails.
Lumos System Architecture
The architecture of Lumos involves a streamlined process for handling multimodal queries. Upon triggering the system, the device captures and processes an image at two resolutions: 3K × 4K (full resolution) and 450 × 600 (thumbnail). Concurrently, automatic speech recognition (ASR) processes the voice query while image capture, compression, and transfer to the cloud proceed. STR begins as soon as the full-resolution image is available, with the system designed to parallelize time-consuming tasks such as STR inference and image transfer to minimize latency.
On the cloud side, a proprietary MM-LLM integrates advanced techniques to process the low-resolution thumbnail, recognized text from STR, and the user query from ASR to generate responses. Text-to-speech (TTS) then converts these responses into voice and sends them back to the user. The design carefully balances efficiency by processing text and images optimally, reflecting the constraints and requirements of on-device and cloud processing.
The architecture incorporates three key design choices: Firstly, STR is performed on-device using the full-resolution image to ensure high-quality text recognition. Secondly, latency in STR is minimized by utilizing hardware acceleration and executing STR and image transfer in parallel. These strategies help maintain efficiency despite the high computational demands of on-device STR.
Finally, the system extends to MM-LLM use cases where STR might not be necessary to answer queries. Lumos ensures flexibility and effectiveness by deferring the decision to the MM-LLM, which can handle both text-heavy and generic questions.
Although on-device STR imposes constraints on model architecture, latency, memory, and battery life, the performance of the on-device STR model remains competitive with cloud-based solutions, thanks to significant optimizations and efficient hardware use.
On-Device STR
Lumos employs a comprehensive on-device STR pipeline with four key components. The ROI detection identifies and extracts a relevant portion of the image, reducing computational cost and noise. Text detection identifies word bounding boxes, while text recognition converts these boxes into readable text.
Reading-order reconstruction organizes recognized words into coherent paragraphs based on their layout. This system addresses challenges specific to on-device STR, including hardware constraints and the variability of in-the-wild text, by using efficient models like Facebook neural architecture search v2 (FBNetv2) and techniques such as keypoint detection for ROI and curriculum learning for text recognition. The result is a robust and efficient STR pipeline that balances accuracy and performance under practical constraints.
Enhanced Text Recognition
Through experimental evaluation, Lumos has demonstrated significant improvements in two key areas: end-to-end question answering and its on-device STR solution's quality, efficiency, and hardware usage. The experiments compared three variants of Lumos: the basic MM-LLM, MM-LLM with on-device STR, and MM-LLM with additional positional information from the reading order reconstruction module.
The results show that integrating on-device STR boosts the question answering (QA) accuracy from 52% to 78%, particularly enhancing performance in summarization tasks. Including positional information further improves accuracy to 79.6%, highlighting the system's capability to handle spatial relationships between words better.
The performance metrics also indicate that Lumos' on-device STR achieves competitive word error rate (WER) scores compared to established STR systems, with the device model being notably efficient despite a slight trade-off in quality.
In terms of efficiency, Lumos' on-device STR solution shows impressive gains. With an export size of approximately 8MB, the model achieves up to a 9x reduction in latency and a 3x decrease in energy consumption when run on a hardware accelerator, compared to a central processing unit (CPU).
The ROI detection component significantly enhances performance by reducing image size while maintaining high word recall, and advanced techniques like data augmentation and model quantization contribute to overall efficiency. These results underscore the effectiveness of Lumos' approach in balancing high-quality text recognition with optimized performance and resource usage on edge devices.
Conclusion
To sum up, this paper introduced Lumos as a pioneering smart multimodal assistant with advanced text understanding capabilities optimized for device compatibility. The evaluation demonstrated that the hybrid approach, combining on-device STR with on-cloud MM-LLM, achieved superior accuracy and met all stringent on-device requirements.
This work marked a significant advancement in integrating MM-LLMs for real-world text recognition applications. Future research will optimize on-device models and explore end-to-end text recognition with MM-LLM.