By addressing safety risks in multi-modal inputs, researchers have developed a new approach that restores the safety alignment of vision-language models, cutting unsafe outputs by over 90% without retraining.
Research: Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models. Image Credit: Krot_Studio / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server by researchers from the University of California, Davis, and Amazon Web Services Artificial Intelligence (AWS AI) Labs, the safety alignment ability of vision-language models (VLMs) was investigated. The article noted that it degrades when integrating the vision module compared to the large LM (LLM) backbone.
They identified this issue as “safety alignment degradation,” stemming from a representation gap in multi-modal inputs. The paper introduced a mathematical formalization of this representation shift, modeling the hidden state as a combination of an ideal representation and a shift caused by the vision module. To address this, they introduced cross-modality representation manipulation (CMRM), effectively restoring the alignment ability without additional training. Their results indicated a significant reduction in the unsafe rate of large language and vision assistants with 7 billion parameters (LLaVA-7B) on multi-modal input.
Related Work
Previous work on safety alignment for VLMs focused on fine-tuning pre-trained models using human preference annotations to ensure they are helpful, honest, and harmless (the 3H principle). Approaches like reinforcement learning (RL) from human feedback (RLHF) and direct preference optimization (DPO) were effective for aligning LLMs but needed adaptation for multi-modal scenarios. While some researchers aligned VLMs with red-teaming data, this method could have been more labor-intensive and efficient.
Enhancing VLM Representation Alignment
The authors formalized the affected hidden states of current VLMs with incorporated vision input. They proposed two variations of CMRM to intervene in model representations during inference. The first step involved modeling the hidden states of multi-modal inputs as they shifted from an ideal representation within the distribution of the LLM backbone. This shifting assumption posits that VLM representations can be represented as an interpolation between two scenarios: one where only text is input and another where the model benefits from visual information without the adverse effects of visual modality interference.
To formalize this concept, the team introduced a mathematical representation where the hidden state is expressed as a combination of an ideal representation and a shift determined by a mixing coefficient. This coefficient indicates the degree of representation shift, with lower values resulting in mild shifts and higher values leading to more significant shifts away from the LLM backbone's representation distribution. The aim was to estimate an ideal representation that solely benefits from visual information, leading to the introduction of a calibration term to correct for any misalignment in multi-modal input representations.
The analysts proposed CMRM to mitigate alignment degradation by eliminating the representation shift when images are incorporated into the input. CMRM focuses on pulling multi-modality representations toward the distribution optimized for the LLM backbone, where safety alignment capabilities were initially developed.
The authors also emphasized that manipulating all layers of the model is crucial to achieving optimal safety performance. This process starts with extracting the shifting vectors that arise from visual input incorporation, which are believed to correlate with alignment degradation. The authors introduced two methods for removing these shifting vectors: dataset-level extraction, capturing overall trends across the dataset, and sample-level extraction, focusing on individual cases. In terms of implementation, CMRM adjusts the model's hidden states by applying the extracted shifting vectors to the last token representations of all layers, resulting in better alignment across modalities. By performing inference on these adjusted representations, the model can effectively leverage additional visual information from the input while avoiding the detrimental effects of representation shifting caused by the visual modality. The paper aims to explore further which specific layers of the model are most suitable for representation manipulation to address the identified issues.
Robust Safety Enhancement for VLMs
The proposed CMRM was empirically evaluated across several perspectives to assess its impact on VLMs. The analysis focused on evaluating CMRM's effectiveness in enhancing safety and alignment recovery within the LLM backbone, the general performance of VLMs, the influence of hyperparameters, the generalizability of extracted shifting directions across datasets, and the effects of CMRM on the models' hidden states.
The results revealed that both dataset- and sample-level CMRM significantly improved safety, reducing the unsafe rate of LLaVA 7B from 61.53% to as low as 5.41% on the VLSafe and JailbreakLLMs datasets, closely approximating pure text inputs. The dataset-level extraction method captured overall trends across multiple samples, while sample-level extraction provided fine-grained adjustments for individual cases. This demonstrated CMRM's efficacy in mitigating safety risks associated with multi-modal inputs. Additionally, CMRM maintained or even enhanced the general utility of the models, with only reasonable computational overhead involved.
Comparatively, CMRM outperformed the training-time baseline method, VLGuard, in several scenarios, highlighting its potential to enhance safety alignment while minimizing extensive fine-tuning efforts. The analysis revealed that an alpha value of 1.0 optimized performance across models, while higher values risked utility. CMRM effectively aligned multi-modal inputs with text representations and improved safety across various datasets, proving a robust method for enhancing VLM performance.
Conclusion
To sum up, the investigation revealed that incorporating visual input degraded the safety alignment of the LLM backbone in VLMs, shifting hidden states away from their optimized distribution. The proposed CMRM method intervened during inference, effectively moving hidden states closer to the trained distribution, thus enhancing safety alignment without extensive retraining.
While CMRM slightly increased computational overhead, it significantly improved safety alignment without compromising overall model performance. Furthermore, the paper identified the importance of manipulating all layers of the model to achieve optimal safety performance. Future work aimed to address degradation in other aspects of VLMs and to explore optimal anchor dataset features for shifting vector extraction.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Qin Liu et al. (2024). Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models. ArXiv, DOI: 10.48550/arXiv.2410.09047, https://arxiv.org/abs/2410.09047