Visual metaphors are powerful tools of communication that utilize imagery to convey complex ideas and emotions. However, generating high-quality visual metaphors is a challenging task, often requiring collaboration between human artists and AI systems. In a recent study posted to the arxiv* server, researchers explored the use of Chain-of-Thought (CoT) prompting to enhance the generation of visual metaphors by diffusion-based text-to-image models. The results demonstrated the potential of Human-AI collaboration in improving the quality and compositionality of visual metaphors.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The limitations of current models
Existing models, such as diffusion-based text-to-image models, excel in handling content-based language but struggle to capture the abstraction and implicit meaning of figurative language. Visual metaphors, being inherently abstract, present difficulties for these models in accurately depicting the intended meaning and symbolism. Therefore, novel approaches are required to bridge the gap between linguistic metaphors and their visual representations.
Advancements in language and text-to-image models
Recent advancements in large language models and text-to-image models have shown promise in facilitating creative processes across various domains. These models have demonstrated their ability to understand and generate human-like text, providing a valuable resource for creative endeavors. For instance, PopBlends, developed by Wang et al. (2023), leverages large language models to automatically generate conceptual blends for pop culture references, opening up new avenues for creative expression. Liu et al. (2023) introduced Generative Disco, an AI system that generates music visualizations using language and text-to-image models, providing a valuable tool for professionals in the creative field. These developments highlight the potential of AI systems in augmenting human creativity.
Creating the HAIVMet dataset
To address the challenges in generating visual metaphors, the present study proposed a collaborative approach combining the strengths of large language and diffusion-based models. The process involves three key steps.
First, visually grounded linguistic metaphors are selected from various sources to ensure their potential for visualization. Linguistic metaphors that possess strong visual imagery and implicit meaning are preferred for the dataset. Second, large language models, specifically Instruct GPT-3 with Chain-of-Thought (CoT) prompting, are used to generate visual elaborations of the linguistic metaphors. CoT prompting involves providing detailed instructions to the model step-by-step, allowing for a more precise and accurate depiction of the implicit meanings and visual elements. These visual elaborations capture the essential objects and implicit meanings involved in linguistic metaphors. Finally, diffusion-based models like DALL·E 2 and Stable Diffusion utilize these visual elaborations as input to generate high-quality visual metaphors. Human experts then validate and refine the generated metaphors to ensure their accuracy and artistic quality.
Evaluation and results
Professional artists and designers evaluated the collaborative approach to assess the effectiveness of the generated visual metaphors. The evaluation compared the output of diffusion-based models with and without the input of visual elaborations from large language models. The results showed that the collaborative approach significantly improved the quality of the generated visual metaphors. When using visual elaborations as input, the models were able to capture the implicit meanings better and depict the objects and relationships involved in the linguistic metaphors. LLM-DALL·E 2 emerged as the most successful model, demonstrating the effectiveness of Human-AI collaboration in enhancing visual metaphor generation.
The HAIVMet dataset
The collaborative approach resulted in the creation of the HAIVMet (Human-AI Visual Metaphor) dataset, which consists of a large collection of visually metaphoric images generated for a wide range of distinct linguistic metaphors. The dataset comprises 6,476 visually metaphoric images created through the collaborative framework. These images cover 1,540 unique linguistic metaphors, ensuring a diverse and comprehensive representation of visual metaphors. The HAIVMet dataset serves as a valuable resource for further research and development in the field of visual metaphor generation. It provides a benchmark for evaluating the performance of different models and techniques, allowing for the exploration of new approaches and improvements in the generation of visual metaphors.
Compositionality in visual metaphors
One of the significant findings of this research is the compositional nature of visual metaphors. Visual metaphors often require the combination of multiple elements to capture the metaphorical meaning effectively. The HAIVMet dataset showcases numerous examples of compositional visual metaphors, where the models successfully combine different objects, properties, and relationships to convey the intended metaphorical meaning. This highlights the importance of considering the compositionality of visual metaphors and the need for collaboration between human artists and AI systems to achieve these complex metaphorical representations.
Utilizing visual metaphors in downstream applications
Visual metaphors not only hold artistic and aesthetic value but also have practical implications in various downstream applications. The HAIVMet dataset, with its diverse collection of visual metaphors, was utilized in a Visual Entailment (VE) task to demonstrate its usefulness. The dataset was used to enhance a state-of-the-art VE model, resulting in a substantial improvement in accuracy compared to the model trained solely on real-world images. This showcases the practical utility and meaningful impact of visual metaphors in advancing vision-language models and their ability to capture metaphoric meanings.
Conclusion and future directions
In conclusion, this research demonstrates the potential of Human-AI collaboration in improving the generation of visual metaphors. By leveraging the strengths of large language models and diffusion-based models, researchers have paved the way for improved quality and compositionality in visual metaphors. The collaborative approach and the creation of the HAIVMet dataset provide valuable resources for further research and development.
Future investigations can build upon these findings to advance AI systems' understanding and generation of visual metaphors, opening up new possibilities for creative expression and communication. It is crucial to continue exploring the impact of prompt phrasing, model variations and expanding the research to other languages to ensure broader representation and inclusivity in visual metaphor generation. With continued research and collaboration, the future holds even greater possibilities for generating visually compelling metaphors.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.