In an article published in the arXiv* server, researchers explored an innovative method for improving the performance of multi-modal large language models (MLLMs) that combine textual and visual interpretation capabilities. They proposed a preference alignment framework that mitigates the degradation of language instruction-following skills from visual instruction tuning. Moreover, they demonstrated the capability of their method for enhancing the MLLM’s performance beyond its original language model on various benchmarks.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
MLLMs are artificial intelligence (AI) systems that can process and generate both text and images, enabling a wide range of applications such as visual question answering, image captioning, and dialogue generation. However, blending multi-modality in one system is non-trivial, as the integration of different data forms creates internal denotation conflicts and raises an issue known as catastrophic forgetting. This phenomenon refers to the loss of previously learned knowledge or skills due to the interference of new information or tasks.
A common method to train MLLMs involves fine-tuning them with visual instruction data including image-text pairs describing tasks or queries. However, this method might diminish the MLLM’s capacity to follow text-only instructions. This occurs because visual instruction data may lack the diversity and complexity of the original text instruction data used to train the underlying language model. This disparity in data quality and quantity significantly contributes to performance degradation in MLLMs.
About the Research
In the present paper, the authors introduced a new method of modality alignment that leveraged preference data, which consisted of pairs of preferred and rejected responses for a given prompt. They collected preference data from a lightweight (6000 entries) visual question answering (VQA) dataset, where the responses were annotated by Gemini, an advanced state-of-the-art and multi-modal AI model, for five quality metrics: helpfulness, correctness, coherence, complexity, and verbosity in a granular fashion. The study then investigated the following four alignment methods to utilize the preference data and improve MLLM’s instruction-following capabilities.
- Standard supervised fine-tuning (SFT): It directly uses the answers provided by Gemini as the ground truth labels for the MLLM.
- Rejection sampling: It selects the best answer among four candidates generated by the MLLM based on Gemini’s scores and uses it as the ground truth label for SFT.
- Steerable language model (SteerLM): It augments the prompts with a description of the desired response quality based on Gemini’s scores and applies conditional SFT to steer the MLLM toward the preferred output.
- Direct preference optimization (DPO): It converts the preference data into pairs of preferred and rejected answers and optimizes the MLLM to assign higher probabilities to the preferred answers than the rejected ones.
Furthermore, the authors utilized the language and language-augmented vision adapter (LLaVA), a state-of-the-art MLLM built on Vicuna, a large language model, as the base model for their experiments. They compared the performance of the four alignment methods on various benchmarks that measured the MLLM’s skills in visual instruction, visual multi-choice, and language instruction tasks.
Research Findings
The outcomes showed that DPO was the most effective alignment method, significantly improving visual and language benchmarks. It not only mitigated the degradation of language instruction skills caused by incorporating visual guidance but also surpassed the performance of Vicuna, the original language model, on multi-turn benchmark (MT-Bench) and automatic evaluator for instruction-following language models (AlpacaEval), two benchmarks that evaluated the helpfulness and accuracy of text-only responses.
Moreover, DPO also boosted the performance of LLaVA on multi-modal visual evaluation tests (MM-Vet) and LLaVA-Bench, two benchmarks that assessed the MLLM’s ability to answer open-ended questions about image contents. These results were achieved with only 5000 preference examples, demonstrating DPO's data efficiency and scalability.
The other alignment methods exhibited mixed results, with rejection sampling performing well on visual multi-choice benchmarks, SteerLM showing moderate improvements on language instruction benchmarks, and standard SFT exhibiting severe degradation on both visual and language benchmarks. The authors attributed the superiority of DPO to its direct optimization of preferences, which avoided the potential errors and biases introduced by using Gemini’s answers as ground truth labels or conditional prompts.
The study contributed to the advancement of multi-modal AI, as it proposes an efficient and effective way to align MLLMs with human-like preferences and expectations. The preference-based alignment framework could enhance the MLLM’s performance on various tasks that require multi-modal reasoning and communication, such as image captioning, visual dialogue, and visual storytelling. The framework could also be applied to other domains and modalities, such as audio, video, and speech, to improve the MLLM’s generalization and robustness.
Conclusion
In summary, the novel preference alignment method could efficiently correct the regression of visual instruction tuning on the language model by restoring and enhancing the language capability of MLLMs. The authors highlighted that DPO is an effective method that can leverage a small preference dataset and significantly improve visual and language instruction tasks.
The researchers acknowledged limitations and challenges, such as the scalability of the data collection, the accuracy of the preference annotations, the ethical implications of model alignment, and the potential risks of safety and bias propagation. They suggested future work to explore the scalability of DPO beyond 6000 preference examples, investigate alternative alignment strategies leveraging more diverse and complex data sources, and develop methods to align MLLMs with human values and societal norms.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.