A new pipeline leveraging AI and vision-language models creates richly detailed captions for comic panels, helping readers with visual impairments experience the full depth of comic stories while enhancing comic analysis with precision.
Research: ComiCap: A VLMs pipeline for dense captioning of Comic Panels
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint server, researchers focused on advancing comic analysis by proposing a pipeline that utilized vision-language models (VLMs) to generate dense, contextually grounded captions for comic panels. They introduced an attribute-retaining metric to evaluate the comprehensiveness of captions and created a densely annotated test set to assess open-source VLMs. The pipeline aimed to enhance understanding of comic storylines by capturing relationships between elements, such as objects and characters. The researchers successfully annotated over two million panels across 13,000 books, creating a valuable resource for further research.
Background
Comics serve as a complex medium for computational analysis, presenting challenges for individuals with visual impairments. Recent research has focused on dialog generation tasks to assist people with visual impairments by transcribing spoken text and linking it to characters. While various benchmarks and methods for detection and dialog generation have emerged, they often lack the necessary contextual information to describe the visual aspects of panels, such as character positioning and background objects, which is vital for understanding comics fully. Previous studies have emphasized dialog transcription, yet none have effectively addressed the need for panel and character descriptions that provide context using grounded approaches.
This paper filled this gap by leveraging VLMs to generate dense captions alongside dialog transcriptions for comic panels. They introduced a custom two-stage metric to assess caption quality and benchmark existing open-source VLMs, achieving superior performance without additional training. Their approach involved first extracting attributes from captions and then comparing them to ground truth data using a Jaccard similarity-based method. The study annotates over 2 million comic panels, significantly enhancing the accessibility of comic content for individuals with visual impairments and advancing the understanding of comic storytelling through detailed, context-rich descriptions.
Captioning and Attribute Extraction Techniques
The authors aimed to generate dense captions for comic book panels, incorporating essential attributes related to the scenes and characters along with their bounding boxes. This was achieved using VLMs in a zero-shot context, allowing for extracting attributes that could later be grounded and evaluated against a designed test set. Unlike traditional captioning methods that prioritize fluency, this approach focuses on retaining key visual details, even at the expense of caption conciseness.
Among the recent VLMs, Idefics2 and MiniCPM-llama3-V-2.5 stood out for their advanced visual processing and language understanding capabilities. Idefics2 employed an autoregressive architecture with a robust vision encoder, while MiniCPM integrated elements from other models to handle high-resolution images effectively. The resampling technique used by MiniCPM was especially crucial for processing complex comic panels without data loss, which helped maintain attribute accuracy. Both models utilized techniques like reinforcement learning from artificial intelligence (AI) feedback to enhance performance.
Grounding models, such as PaliGemma and Florence2, were also notable for their ability to generate bounding boxes and segmentation masks. PaliGemma employed a prefix-LM masking paradigm, while Florence2 used a unified approach to manage various vision-language tasks, facilitating dense captioning through a two-step process. Florence2, for example, was trained on a massive dataset of over 5.4 billion annotations, making it particularly adept at handling the diverse visual elements found in comics.
The extraction of attributes focused on retaining essential details about characters and scenes, utilizing a custom procedure based on large language models (LLMs). This method not only ensured high alignment with ground truth but also accounted for synonyms and visual context through techniques like synonym generation using GroundingDINO, which helped mitigate issues with object detection in complex scenes. The ultimate goal was to ensure that the generated captions comprehensively reflected the relevant attributes and elements of the comic panels.
Attribute Retaining Metric and Pipeline
The attribute retaining metric (ARM) was designed to compare sets of attributes generated by a model with ground truth elements using a two-step process. First, the metric calculated a pairwise bidirectional encoder representation from transformers (BERT)-score for all potential pairs to establish associations between the predicted attributes and the ground truth, avoiding misleading matches by applying a chosen threshold (τ) between 0.5 and 0.99.
The resulting associations replaced predicted attributes with corresponding ground truth elements, leading to a modified attribute set. In the second step, the Jaccard similarity, an intersection-over-union metric, was applied to assess the sets' similarity, producing the ARM score as an average similarity measure.
In evaluating various VLMs for comic book panel captioning, MiniCPM achieved the highest ARM score, emphasizing the retention of important attributes over traditional metrics like recall-oriented understudy for gisting evaluation (ROUGE) and bilingual evaluation understudy (BLEU). This is due to the model’s ability to prioritize attribute extraction over syntactic similarity, which traditional metrics like BLEU favor.
Additionally, the pipeline included techniques such as removing optical character recognition (OCR)-detected text and separating panel and character captions for improved clarity. The GroundingDINO model was utilized to detect objects in the attribute set, with an added feature of generating synonymous terms for objects missed during initial detection, significantly enhancing the coverage of captioned elements. A comprehensive dataset of over 13,000 comics was developed, with more than 1.5 million panels and 2.06 million characters detected, creating a richly captioned comics dataset for research purposes. This methodology demonstrated significant advancements in dense captioning and attribute extraction in comics.
Conclusion
In conclusion, the researchers successfully advanced comic analysis through a novel pipeline that generated dense captions for comic panels using VLMs. By introducing an attribute-retaining metric and densely annotating over two million panels, the research addressed the accessibility challenges faced by individuals with visual impairments.
The study highlights the importance of grounding visual elements within text, providing an essential context for users relying on screen readers and other assistive technologies. The use of advanced VLMs enabled the extraction of rich, context-aware descriptions that enhanced understanding of comic storylines. The findings not only contributed valuable resources to the research community, including the ComiCap dataset, but also set the stage for real-time applications in educational and recreational settings, particularly for the visually impaired community. Future work will focus on expanding the dataset to include a wider variety of comic styles and exploring real-time captioning systems and further improvements to grounding techniques, such as more accurate attribute extraction through the integration of multiple VLMs.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Source Vivoli, E., Biondi, N., Bertini, M., & Karatzas, D. (2024). ComiCap: A VLMs pipeline for dense captioning of Comic Panels. ArXiv.org. DOI: 10.48550/arXiv.2409.16159, https://arxiv.org/abs/2409.16159v1