In an article recently submitted to the arXiv* server, researchers introduced a novel multitask retrieval-augmented vision-language network (RAVEN) framework. Unlike existing methods, RAVEN enhanced the base vision-language model (VLMs) through efficient, task-specific fine-tuning without requiring additional retrieval-specific parameters.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The model achieved significant performance gains by integrating retrieval-augmented samples: it improved +1 consensus-based image description evaluation. (CIDEr) on Microsoft common objects in context (MSCOCO) and +4 CIDEr on NoCaps for image captioning. It boosted accuracy by nearly +3% on specific visual question answering (VQA) question types. These results underscored the efficacy of applying retrieval-augmented generation (RAG) approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
Related Work
Past work in VLMs has integrated visual and textual data for tasks like image captioning and classification. Early frameworks like optimal flexible architecture (OFA) and graph interaction transformer (GIT) were followed by models augmenting large language models with visual encoders. In natural language processing (NLP), retrieval augmentation, starting with the k-nearest neighbors language model (kNN-LM), has expanded to include large corpora like Wikipedia, benefiting models by enhancing knowledge-intensive tasks like question answering.
Multimodal Retrieval Framework
The proposed RAVEN framework integrates multimodal images and text inputs through a process that begins with a retriever accessing relevant image-text pairs from external memory. This system utilizes the Facebook artificial intelligence similarity search (FAISS) library for efficient semantic search, leveraging an image encoder to score query-image similarities and retrieve top 'k' image-text pairs via maximum inner product search (MIPS) from the Laion5B dataset. Near duplicates are excluded to ensure relevance and diversity, and retrieved samples are mapped to captions for consistency.
RAVEN employs a multitask encoder-decoder VLM architecture, combining ResNet for image encoding and byte-pair encoding (BPE) for text conversion. This framework uses a unified vocabulary encompassing linguistic and visual tokens within a transformer backbone. Enhanced with head scaling, layer normalization, and specific position embeddings for text and images, the model supports tasks like image captioning and VQA through sequence-to-sequence (Seq2Seq) generation.
Retrieval mechanisms within RAVEN significantly enhance VL task performance by providing crucial contextual information and mitigating biases from training data. This approach optimally utilizes OFA as its backbone due to its multitask integration, open-source adaptability, and manageable parameter size of 182M.
By avoiding additional trainable parameters from recent multimodal language model (MLLM) models, RAVEN focuses on integrating retrieval capabilities directly within the encoder-decoder framework, underscoring its versatility and applicability across various multimodal tasks.
Retrieval-Augmented Fine-Tuning
The approach's performance was thoroughly evaluated through fine-tuning diverse image captioning and VQA benchmarks. The primary focus was to highlight the advantages of integrating retrieval augmentation, where relevant knowledge is retrieved from a large external database and utilized during fine-tuning. This retrieval process involved mapping down from Laion-5B to Laion-COCO 600M, ensuring that retrieved samples were diverse and aligned with the style of the target datasets.
For the training setup, datasets like MSCOCO 2014 Karpathy Splits for image captioning and augmented VQA v2 with VG-QA questions were utilized. Notably, the team maintained a strict non-overlapping policy between the fine-tuning datasets and external memory to assess the impact of retrieval augmentation in practical scenarios. Handling missing samples due to retrieval mismatches or download failures was crucial, ensuring robustness in inference scenarios where retrieved-context might be absent.
Implementation-wise, the approach leveraged contrastive language-image pre-training (CLIP) for image encoding and FAISS for efficient retrieval using MIPS-based top-50 retrieval from Laion-5B. The retrieved samples, including captions and alt text, were concatenated with the original samples during the fine-tuning process. A lightweight OFA-based model with 182M parameters was employed, optimized with cross-entropy loss, and a beam search was utilized for inference to enhance generation quality.
In the evaluation setup, several baselines were established to benchmark the performance of RAVEN against retrieval-only scenarios and models fine-tuned without retrieval augmentation. For both captioning and VQA tasks, significant improvements over non-retrieval baselines were observed, validating the efficacy of the approach. Specifically, competitive performance metrics such as CIDEr for captioning and accuracy for VQA showcased the model's robustness and superiority compared to similar-sized models.
Finally, the qualitative analysis provided insights into RAVEN's strengths and limitations. Successful examples illustrated how the model effectively leveraged retrieved knowledge to generate accurate captions and answer complex questions. However, challenges were noted, particularly in cases where the retrieved context was insufficient or noisy, impacting performance in specific VQA scenarios. These findings underscored the importance of carefully handling and selecting retrieved data to maximize model performance across varied tasks and datasets.
Conclusion
To sum up, the proposed retrieval augmentation framework addresses challenges posed by escalating model size and computational demands. A multitask, multimodal retrieval-augmented VLM demonstrated adaptability across various tasks through efficient task-specific fine-tuning.
Leveraging concatenated multimodal retrieval-augmented samples from external non-overlapping memory, a single model acquired robust retrieval properties without adding trainable parameters. This unified approach showcased significant benefits in both captioning and VQA tasks. Extensive ablations across text, image, and image-text modalities systematically compared against non-retrieved baselines provided valuable insights.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Rao, V. N., et al. (2024). RAVEN: Multitask Retrieval Augmented Vision-Language Learning. ArXiv. DOI: 10.48550/arXiv.2406.19150, https://arxiv.org/abs/2406.19150