Lumos Enhances Multimodal AI with On-Device STR

In an article recently posted to the Meta Research website, researchers introduced Lumos, the first end-to-end multimodal question-answering system with text-understanding capabilities. Lumos integrated a scene text recognition (STR) component to extract text from first-person images, enhancing input for a multimodal large language model (MM-LLM).

Study: Lumos Enhances Multimodal AI with On-Device STR. Image Credit: Ken stocker/Shutterstock.com
Study: Lumos Enhances Multimodal AI with On-Device STR. Image Credit: Ken stocker/Shutterstock.com

The paper addressed STR quality, latency, and model inference challenges. It detailed the system architecture, design choices, techniques, and a comprehensive evaluation demonstrating high quality and efficiency.

Background

Past work in visual question answering has seen increased attention with recent progress in LLMs and vision-language pre-training. Industry forecasts that smart assistants will soon achieve a human-like understanding of scenes and images. Prior approaches often used MM-LLMs for text understanding in images without standalone STR components. Implementations typically involved transferring high-resolution images to the cloud for processing, which faced latency issues and degraded performance with low-resolution thumbnails.

Lumos System Architecture

The architecture of Lumos involves a streamlined process for handling multimodal queries. Upon triggering the system, the device captures and processes an image at two resolutions: 3K × 4K (full resolution) and 450 × 600 (thumbnail). Concurrently, automatic speech recognition (ASR) processes the voice query while image capture, compression, and transfer to the cloud proceed. STR begins as soon as the full-resolution image is available, with the system designed to parallelize time-consuming tasks such as STR inference and image transfer to minimize latency.

On the cloud side, a proprietary MM-LLM integrates advanced techniques to process the low-resolution thumbnail, recognized text from STR, and the user query from ASR to generate responses. Text-to-speech (TTS) then converts these responses into voice and sends them back to the user. The design carefully balances efficiency by processing text and images optimally, reflecting the constraints and requirements of on-device and cloud processing.

 The architecture incorporates three key design choices: Firstly, STR is performed on-device using the full-resolution image to ensure high-quality text recognition. Secondly, latency in STR is minimized by utilizing hardware acceleration and executing STR and image transfer in parallel. These strategies help maintain efficiency despite the high computational demands of on-device STR.

Finally, the system extends to MM-LLM use cases where STR might not be necessary to answer queries. Lumos ensures flexibility and effectiveness by deferring the decision to the MM-LLM, which can handle both text-heavy and generic questions.

Although on-device STR imposes constraints on model architecture, latency, memory, and battery life, the performance of the on-device STR model remains competitive with cloud-based solutions, thanks to significant optimizations and efficient hardware use.

On-Device STR

Lumos employs a comprehensive on-device STR pipeline with four key components. The ROI detection identifies and extracts a relevant portion of the image, reducing computational cost and noise. Text detection identifies word bounding boxes, while text recognition converts these boxes into readable text.

Reading-order reconstruction organizes recognized words into coherent paragraphs based on their layout. This system addresses challenges specific to on-device STR, including hardware constraints and the variability of in-the-wild text, by using efficient models like Facebook neural architecture search v2 (FBNetv2) and techniques such as keypoint detection for ROI and curriculum learning for text recognition. The result is a robust and efficient STR pipeline that balances accuracy and performance under practical constraints.

Enhanced Text Recognition

Through experimental evaluation, Lumos has demonstrated significant improvements in two key areas: end-to-end question answering and its on-device STR solution's quality, efficiency, and hardware usage. The experiments compared three variants of Lumos: the basic MM-LLM, MM-LLM with on-device STR, and MM-LLM with additional positional information from the reading order reconstruction module.

The results show that integrating on-device STR boosts the question answering (QA) accuracy from 52% to 78%, particularly enhancing performance in summarization tasks. Including positional information further improves accuracy to 79.6%, highlighting the system's capability to handle spatial relationships between words better.

The performance metrics also indicate that Lumos' on-device STR achieves competitive word error rate (WER) scores compared to established STR systems, with the device model being notably efficient despite a slight trade-off in quality.

In terms of efficiency, Lumos' on-device STR solution shows impressive gains. With an export size of approximately 8MB, the model achieves up to a 9x reduction in latency and a 3x decrease in energy consumption when run on a hardware accelerator, compared to a central processing unit (CPU).

The ROI detection component significantly enhances performance by reducing image size while maintaining high word recall, and advanced techniques like data augmentation and model quantization contribute to overall efficiency. These results underscore the effectiveness of Lumos' approach in balancing high-quality text recognition with optimized performance and resource usage on edge devices.

Conclusion

To sum up, this paper introduced Lumos as a pioneering smart multimodal assistant with advanced text understanding capabilities optimized for device compatibility. The evaluation demonstrated that the hybrid approach, combining on-device STR with on-cloud MM-LLM, achieved superior accuracy and met all stringent on-device requirements.

This work marked a significant advancement in integrating MM-LLMs for real-world text recognition applications. Future research will optimize on-device models and explore end-to-end text recognition with MM-LLM.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, August 28). Lumos Enhances Multimodal AI with On-Device STR. AZoAi. Retrieved on January 14, 2025 from https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx.

  • MLA

    Chandrasekar, Silpaja. "Lumos Enhances Multimodal AI with On-Device STR". AZoAi. 14 January 2025. <https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Lumos Enhances Multimodal AI with On-Device STR". AZoAi. https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx. (accessed January 14, 2025).

  • Harvard

    Chandrasekar, Silpaja. 2024. Lumos Enhances Multimodal AI with On-Device STR. AZoAi, viewed 14 January 2025, https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
ScienceAgentBench Exposes Language Agents' Challenges in Automating Scientific Workflows