RAVEN Boosts Vision-Language Tasks with Retrieval Augmentation

In an article recently submitted to the arXiv* server, researchers introduced a novel multitask retrieval-augmented vision-language network (RAVEN) framework. Unlike existing methods, RAVEN enhanced the base vision-language model (VLMs) through efficient, task-specific fine-tuning without requiring additional retrieval-specific parameters.

Study: RAVEN Boosts Vision-Language Tasks with Retrieval Augmentation. Image Credit: Thanadon88/Shutterstock
Study: RAVEN Boosts Vision-Language Tasks with Retrieval Augmentation. Image Credit: Thanadon88/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

The model achieved significant performance gains by integrating retrieval-augmented samples: it improved +1 consensus-based image description evaluation. (CIDEr) on Microsoft common objects in context (MSCOCO) and +4 CIDEr on NoCaps for image captioning. It boosted accuracy by nearly +3% on specific visual question answering (VQA) question types. These results underscored the efficacy of applying retrieval-augmented generation (RAG) approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.

Related Work

Past work in VLMs has integrated visual and textual data for tasks like image captioning and classification. Early frameworks like optimal flexible architecture (OFA) and graph interaction transformer (GIT) were followed by models augmenting large language models with visual encoders. In natural language processing (NLP), retrieval augmentation, starting with the k-nearest neighbors language model (kNN-LM), has expanded to include large corpora like Wikipedia, benefiting models by enhancing knowledge-intensive tasks like question answering.

Multimodal Retrieval Framework

The proposed RAVEN framework integrates multimodal images and text inputs through a process that begins with a retriever accessing relevant image-text pairs from external memory. This system utilizes the Facebook artificial intelligence similarity search (FAISS) library for efficient semantic search, leveraging an image encoder to score query-image similarities and retrieve top 'k' image-text pairs via maximum inner product search (MIPS) from the Laion5B dataset. Near duplicates are excluded to ensure relevance and diversity, and retrieved samples are mapped to captions for consistency.

RAVEN employs a multitask encoder-decoder VLM architecture, combining ResNet for image encoding and byte-pair encoding (BPE) for text conversion. This framework uses a unified vocabulary encompassing linguistic and visual tokens within a transformer backbone. Enhanced with head scaling, layer normalization, and specific position embeddings for text and images, the model supports tasks like image captioning and VQA through sequence-to-sequence (Seq2Seq) generation.

Retrieval mechanisms within RAVEN significantly enhance VL task performance by providing crucial contextual information and mitigating biases from training data. This approach optimally utilizes OFA as its backbone due to its multitask integration, open-source adaptability, and manageable parameter size of 182M.

By avoiding additional trainable parameters from recent multimodal language model (MLLM) models, RAVEN focuses on integrating retrieval capabilities directly within the encoder-decoder framework, underscoring its versatility and applicability across various multimodal tasks.

Retrieval-Augmented Fine-Tuning

The approach's performance was thoroughly evaluated through fine-tuning diverse image captioning and VQA benchmarks. The primary focus was to highlight the advantages of integrating retrieval augmentation, where relevant knowledge is retrieved from a large external database and utilized during fine-tuning. This retrieval process involved mapping down from Laion-5B to Laion-COCO 600M, ensuring that retrieved samples were diverse and aligned with the style of the target datasets.

For the training setup, datasets like MSCOCO 2014 Karpathy Splits for image captioning and augmented VQA v2 with VG-QA questions were utilized. Notably, the team maintained a strict non-overlapping policy between the fine-tuning datasets and external memory to assess the impact of retrieval augmentation in practical scenarios. Handling missing samples due to retrieval mismatches or download failures was crucial, ensuring robustness in inference scenarios where retrieved-context might be absent.

Implementation-wise, the approach leveraged contrastive language-image pre-training (CLIP) for image encoding and FAISS for efficient retrieval using MIPS-based top-50 retrieval from Laion-5B. The retrieved samples, including captions and alt text, were concatenated with the original samples during the fine-tuning process. A lightweight OFA-based model with 182M parameters was employed, optimized with cross-entropy loss, and a beam search was utilized for inference to enhance generation quality.

In the evaluation setup, several baselines were established to benchmark the performance of RAVEN against retrieval-only scenarios and models fine-tuned without retrieval augmentation. For both captioning and VQA tasks, significant improvements over non-retrieval baselines were observed, validating the efficacy of the approach. Specifically, competitive performance metrics such as CIDEr for captioning and accuracy for VQA showcased the model's robustness and superiority compared to similar-sized models.

Finally, the qualitative analysis provided insights into RAVEN's strengths and limitations. Successful examples illustrated how the model effectively leveraged retrieved knowledge to generate accurate captions and answer complex questions. However, challenges were noted, particularly in cases where the retrieved context was insufficient or noisy, impacting performance in specific VQA scenarios. These findings underscored the importance of carefully handling and selecting retrieved data to maximize model performance across varied tasks and datasets.

Conclusion

To sum up, the proposed retrieval augmentation framework addresses challenges posed by escalating model size and computational demands. A multitask, multimodal retrieval-augmented VLM demonstrated adaptability across various tasks through efficient task-specific fine-tuning.

Leveraging concatenated multimodal retrieval-augmented samples from external non-overlapping memory, a single model acquired robust retrieval properties without adding trainable parameters. This unified approach showcased significant benefits in both captioning and VQA tasks. Extensive ablations across text, image, and image-text modalities systematically compared against non-retrieved baselines provided valuable insights.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Rao, V. N., et al. (2024). RAVEN: Multitask Retrieval Augmented Vision-Language Learning. ArXiv. DOI: 10.48550/arXiv.2406.19150, https://arxiv.org/abs/2406.19150
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, July 11). RAVEN Boosts Vision-Language Tasks with Retrieval Augmentation. AZoAi. Retrieved on December 11, 2024 from https://www.azoai.com/news/20240711/RAVEN-Boosts-Vision-Language-Tasks-with-Retrieval-Augmentation.aspx.

  • MLA

    Chandrasekar, Silpaja. "RAVEN Boosts Vision-Language Tasks with Retrieval Augmentation". AZoAi. 11 December 2024. <https://www.azoai.com/news/20240711/RAVEN-Boosts-Vision-Language-Tasks-with-Retrieval-Augmentation.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "RAVEN Boosts Vision-Language Tasks with Retrieval Augmentation". AZoAi. https://www.azoai.com/news/20240711/RAVEN-Boosts-Vision-Language-Tasks-with-Retrieval-Augmentation.aspx. (accessed December 11, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. RAVEN Boosts Vision-Language Tasks with Retrieval Augmentation. AZoAi, viewed 11 December 2024, https://www.azoai.com/news/20240711/RAVEN-Boosts-Vision-Language-Tasks-with-Retrieval-Augmentation.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers