In an article recently submitted to the ArXiv* server, researchers explored Large Language Models (LLMs), which have brought substantial progress in multimodal comprehension. Nevertheless, the effectiveness of pre-existing sophisticated algorithms in fully harnessing the extensive ability to represent and abundant inherent world knowledge in these pre-trained models remained limited.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Furthermore, the valuable interrelationships between tasks in text-rich contexts had not been thoroughly explored. This paper introduced UniDoc, an innovative multimodal model with integrated text detection and recognition abilities addressing a gap present in current approaches. Notably, UniDoc leveraged the advantageous interactions between tasks to elevate the effectiveness of each task.
Background
Recently, the realm of LLMs has seen remarkable progress, encompassing models like Chat Generative Pre-trained Transformer (ChatGPT), Bridging Language Over Other Modality (BLOOM), and Large Language Model for Multimodal Alignment (LLaMA). These models show exceptional potential in diverse linguistic applications and contribute to the pursuit of artificial general intelligence (AGI). Multimodal counterparts (LMMs) like Bootstrapping Language Image Pre-training (BLIP), Mini Generative Pre-trained Transformer 4 (MiniGPT-4), Large Language Model for Vision and Alignment (LLaVA), and Multimodal Perceiver for Language Understanding and Generation with Object-Wise Learning (mPLUG-Owl), extend this progress to visual and linguistic understanding.
While their zero-shot multimodal skills are impressive, these models still struggle with text-rich images. The Large Language Vision and Reasoning (LLaVAR) introduces text recognition pre-training, and mPLUG-DocOwl focuses on document image understanding to address the issue. However, the broader potential of large pre-trained models remains largely untapped.
Related work
Previous research has delved into instructional fine-tuning and multimodal instruction fine-tuning, enhancing LLMs through human intent alignment for improved multi-task learning and generalization. GPT-3 and Alpaca exemplify this, while Vicuna attains ChatGPT-level performance via LLaMA fine-tuning based on user interactions. In LMMs, integrating language models and visual encoders for text and vision tasks is a prominent trend, including models like MiniGPT-4, LLaVA, and mPLUG-Owl.
Despite innovations such as the instruction dataset of InstructBLIP and the adapters of X-LLM, their untapped potential persists. Additionally, advancements like the text recognition pre-training LLaVAR and the document dataset of mPLUG-DocOwl address text-rich comprehension, but comprehensive text detection, recognition, and spotting abilities remain incomplete, missing potential benefits from tandem learning.
Proposed method
In the present paper, the architecture of UniDoc is depicted, which processes a given RGB image I and a natural language instruction Q. Using Contrastive Language–Image Pretraining with Vision Transformer (CLIP-ViT)-L/14 as the visual encoder, UniDoc extracts visual features from I, combines them with text cues from I and Q, and feeds this combined information into Vicuna. Vicuna, a large language model fine-tuned with instruction following data, generates contextually appropriate responses, utilizing visual embedding as a soft prompt for language generation.
The training process is divided into two stages, both centered around unified multimodal instruct tuning. In the pre-training phase, the visual and language models remain frozen, and the linear projector aligning their features is exclusively trained. This phase employs instruction following data encompassing text detection, recognition, spotting, and image captioning tasks. Transitioning to the fine-tuning phase, both the large language model and projector are unfrozen. Additional tasks catering to advanced semantic comprehension in text-rich images are introduced. This unified methodology strengthens the model's capacity for text-rich scenario understanding, culminating in comprehensive recognition and comprehension abilities.
Experimental results
To train UniDoc, a large-scale multimodal instruction following the dataset is constructed. It consists of pre-training and fine-tuning phases. Pre-training data includes 595K natural scene images with captions from the CC3M dataset and 600K image-text pairs from PowerPoint presentations. The choice of PowerPoint files allows for diverse visual elements and legible text. To ensure quality, data with small-sized text are excluded, and text and box annotations are extracted using an in-house OCR tool. Instructions are categorized as text detection, recognition, and understanding, each with diverse expressions generated by GPT-4. Fine-tuning uses a 16K LAION-5B dataset with OCR instruction following data, and 150K OCR instruction data are incorporated and divided into detection, recognition, and spotting categories.
In the training phase, UniDoc follows a one-cycle learning rate policy with rates of 1e-3 and 1e-5 for pre-training and fine-tuning, respectively. The batch sizes are 128 and 32, using the AdamW optimizer across eight A100 GPUs for both stages, each lasting an epoch. During the evaluation, UniDoc excels in text detection, recognition, and multimodal understanding, utilizing F-score for detection and accuracy for recognition and question-answering. Comparative analysis with existing large multimodal models showcases the superior performance of UniDoc across a range of benchmarks. Ablation studies emphasize the value of tasks such as text detection, recognition, and spotting in both pre-training and fine-tuning, as well as the formulation of detection tasks and instruction template type in determining the efficacy of UniDoc.
Conclusion
To summarize, this paper presents UniDoc, a powerful large multimodal model that excels in tasks like text detection, recognition, spotting, and understanding. By adopting a unified multimodal instruct tuning strategy, UniDoc leverages the interplay of these text-based tasks to enhance its capabilities beyond existing models. Creating a substantial multimodal instruction dataset supports the performance of UniDoc, which outperforms other models in multiple benchmarks. While certain limitations persist, such as the inability to extract fine-grained visual details and input image resolution constraints, future work aims to tackle these issues.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.