Language as the Medium: Text-Based Multimodal Video Classification

Download PDF Copy

By Dr. Sampath LonkaReviewed by Susha Cheriyedath, M.Sc.Sep 24 2023

In a recent submission to the arXiv* server, researchers introduced a novel model-agnostic approach for generating detailed textual descriptions that encompass the rich multimodal information present. Leveraging the knowledge of large language models (LLMs) such as Generative Pre-trained Transformers (GPT)-3.5 and Llama2, this method extracts insights from textual descriptions of visual and auditory elements from sources such as Bootstrapping Language Image Pre-training Image Encoders (BLIP-2), Whisper, and ImageBind.

*Study: Bridging Modalities: A Novel Multimodal Classification Approach Using Textual Cues. Image credit: metamorworks /Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

Recent years have witnessed remarkable advancements in LLMs for text, showcasing unprecedented performance in various tasks. This progress has inspired efforts to bridge the gap between the visual and textual domains. Notable approaches include Contrastive Language-Image Pre-training (CLIP), Perceiver IO, Kosmos, and GPT-4, which leverage multimodal information to varying degrees. In contrast, the current study reveals the potential of using text as a medium to convey multimodal information to downstream LLMs. This approach offers several advantages, including a straightforward interface for model chaining, transparent inter-model communication in natural language, and simplifying multimodal video classification into "perception" and "reasoning" phases.

While similar methods, such as LENS and Video Chat Captioner, explore textual interactions between models, this research demonstrates that LLMs can effectively classify video actions based solely on textual representations of auditory and visual cues.

Perception and Reasoning Models

The process involves perception and reasoning models. For perception, BLIP-2 extracts visual captions from five selected video frames. Whisper provides audio transcripts using Faster Whisper with a zero temperature and a voice activity detection (VAD) filter. Audio tags are generated using ImageBind to compute similarity with textual AudioSet labels.

In the reasoning phase, three state-of-the-art LLMs are evaluated: GPT3.5-turbo, Claude-instant-1, and Llama-2-13b-chat. All use a low-temperature setting for consistency. Prompts follow a template to classify actions based on multimodal clues and labels, returning the top five in JSON format.

Structured output is achieved differently for each model, involving JSON schemas, prompts, or parsing numbered lists. Parsing issues are considered incorrect predictions.

Experiments and Analysis

A dataset of 101 human action classes from videos in the wild (UCF-101) was employed. This dataset comprises 13,320 video clips from YouTube, ranging from playing instruments to engaging in sports activities. On the other hand, the Kinetics400 dataset contains 10-second video clips from YouTube covering 400 human action classes. To manage API costs, a smaller subset of 2,000 videos, with five videos per category, was created from the original 38,685 video clips.

The experiments aim to understand each modality's significance in video classification. The results demonstrate that incorporating audio information benefits the language model's performance. Furthermore, the study evaluates three large language models—GPT3.5-turbo, Claude-instant-1, and Llama2—in interpreting visual and auditory data. Claude-instant-1 consistently achieves the highest accuracy.

The impact of including more or fewer frame captions is also investigated. Interestingly, while Claude-1 and GPT3.5 benefit from additional captions, LLama2's performance deteriorates. This phenomenon is attributed to potential information overload, causing LLama2 to favor words from the captions over the label list.

Challenges and Opportunities in Multimodal Understanding

Using separate models to translate visual and speech data into text can limit the ability to capture intermodal interactions, leading to incomplete context understanding. Analyzing images frame by frame lacks temporal modeling for consistent identity and relationship tracking over time. Although generative models are powerful, they may produce unreliable outputs and hallucinations, making them challenging for tasks requiring consistency. Additionally, relying solely on class names for model input assumes descriptive class names, which may not always hold, especially for specific categories such as musical instruments. The models also struggle with actions that demand more context, such as head massage or obscure objects.

While not matching the state-of-the-art (SOTA) zero-shot performance, the proposed method demonstrates greater generalizability in video-understanding scenarios requiring complex contextual reasoning. Future work could incorporate additional context, like video comments, or a chat-based approach where the "reasoning" module can seek clarification from the "perception" module for more information.

Conclusions

In summary, the current study introduces a novel two-phase multimodal classification approach, showcasing the power of text in interpreting complex data and highlighting the effectiveness of textual cues for video classification.

A groundbreaking achievement of this work is the establishment of a zero-shot video classification system by linking perception models for vision, speech, and audio with LLMs, relying solely on textual representations of multimodal signals. This research underscores the promising prospects of employing natural language as a versatile interface for seamlessly integrating signals across diverse modalities.

Journal reference:

Preliminary scientific report. Hanu, L., Verő, A. L., and Thewlis, J. (2023). Language as the Medium: Multimodal Video Classification through text only. arXiv. DOI: https://doi.org/10.48550/arXiv.2309.10783, https://arxiv.org/abs/2309.10783

Posted in: AI Research News

Comments (0)

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Lonka, Sampath. (2023, September 24). Language as the Medium: Text-Based Multimodal Video Classification. AZoAi. Retrieved on July 03, 2025 from https://www.azoai.com/news/20230924/Language-as-the-Medium-Text-Based-Multimodal-Video-Classification.aspx.
MLA
Lonka, Sampath. "Language as the Medium: Text-Based Multimodal Video Classification". AZoAi. 03 July 2025. <https://www.azoai.com/news/20230924/Language-as-the-Medium-Text-Based-Multimodal-Video-Classification.aspx>.
Chicago
Lonka, Sampath. "Language as the Medium: Text-Based Multimodal Video Classification". AZoAi. https://www.azoai.com/news/20230924/Language-as-the-Medium-Text-Based-Multimodal-Video-Classification.aspx. (accessed July 03, 2025).
Harvard
Lonka, Sampath. 2023. Language as the Medium: Text-Based Multimodal Video Classification. AZoAi, viewed 03 July 2025, https://www.azoai.com/news/20230924/Language-as-the-Medium-Text-Based-Multimodal-Video-Classification.aspx.