In a recent paper submitted to the arXiv* server, researchers introduced a novel model called the Multimodal Audio-Image to Video Action Recognition Transformer (MAiVAR-T). This innovative approach enhances multimodal human action recognition (MHAR) by fusing audio-image representations with video, capitalizing on contextual richness in both modalities. Unlike existing strategies, MAiVAR-T excels by combining video and audio modalities, as validated through comprehensive benchmark evaluations.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Human action recognition holds immense importance in domains such as surveillance, interactive gaming, robotics, and healthcare. While visual cues have historically taken precedence, the significance of auditory elements in human actions cannot be ignored. The fusion of audio and visual cues in MHAR offers a more comprehensive understanding.
However, present MHAR models encounter challenges in effectively merging multimodal data. Convolutional neural networks (CNNs) require greater computational resources compared to their image counterparts. In some cases, these methods even involve convolutional operations across spatiotemporal dimensions. On the contrary, long short-term memory (LSTMs) and recurrent neural networks (RNNs) encounter challenges when processing extended sequences due to limitations in memory efficiency.
Advancing MHAR through deep learning and transformers
Deep learning has made remarkable strides in advancing MHAR, facilitating the extraction of vital features from multimodal data to enhance action recognition. CNNs extract spatial features, while LSTMs model temporal dynamics. However, the traditional CNN-LSTM approach faces challenges in multimodal fusion and managing extended temporal sequences.
Transformers exhibit excellence in domains such as image classification, natural language processing, and video understanding. The self-attention mechanism in transformers holds promise for enhancing multimodal fusion and feature extraction in MHAR tasks, yet its application in MHAR remains relatively unexplored.
In audiovisual multimodal learning, history spans pre-deep learning to the present era. Early methods used simpler techniques due to data and computational limitations. Deep learning introduced sophisticated strategies, implicitly acquiring joint latents or modality-specific for fusion and enhancing supervised audiovisual tasks.
Joint training of modality-specific CNNs often involves combining activations through summation. Vision Transformers (ViT) and Video Vision Transformers (ViViT) transformed MHAR. ViT segments images into interpretable patches, enhancing action recognition in still images. ViViT extends this to video, decoding spatiotemporal dynamics in human movements. ViT and ViViT reshape multimodal action recognition, empowering accurate classification and nuanced understanding across visual and auditory domains.
Enhanced Transformer Model for MHAR
Human actions were gathered from the University of Central Florida (UCF)-101 benchmark dataset. Video clips and corresponding audio streams comprised each instance. In line with the focus on audio effects, videos without audio were excluded. The result was a total of 6837 videos, spanning 51 categories. Both audio and video data underwent distinct preprocessing steps. Video data was translated into individual frames, whereas audio data was transformed into six uniform, normalized audio-image representations. Notable features of audio-image representations include dimensionality reduction, resistance to visual alterations, standardization for cross-source comparison, and applicability in privacy-centric domains such as surveillance.
The current study introduces a pioneering transformer-based model, MAiVAR-T. It integrates an audio transformer, a cross-modal attention layer, and a video transformer. Audio processing employs ViT, preserving positional information. Video mapping involves tokenization and positional embedding, aligned with transformer-based ViT principles.
Audio-visual fusion strategies and results
Audio-image representations were divided into patches. To provide spatial context, the positional embeddings were incorporated into the architecture. Training data was batched in sets of 16 samples each, augmented using techniques such as time-stretching and random cropping for enhanced robustness. Extracted features from the video were fed into the AV-Fusion multi-layer perceptron (MLP) multimodal fusion module, enabling action class classification. During the training process, a multimodal cross-entropy loss function was employed to maintain a balance between video and audio modalities. The transformer-based model underwent a total of 100 epochs, implementing an alpha-learning rate schedule that experienced a 10% reduction every four epochs. The Adam optimizer was used, and the dropout technique was applied to avoid overfitting during training.
The outcomes demonstrate substantial contributions to audio and video transformers' final action recognition performance alongside the cross-modal attention layer. The accuracy metric assesses the model's correct predictions proportionally. Comparing the features extracted from transformers to their counterparts from CNNs, experiments reveal MAiVAR-T's superiority, surpassing previous techniques by a three percent margin.
Conclusion
In summary, CNNs have dominated action video classification with video modalities. However, the current study challenges the video-centric paradigm and introduces a multi-modal framework based on transformers: MAiVAR-T. This model, driven by fusion and designed as an end-to-end solution, streamlines the process while simultaneously enhancing performance.
Experimental results indicate that the suggested transformer-based fusion of audio-image to video aligns with conventional image-only methods, corroborating previous research findings. Leveraging pre-training on extensive video datasets presents further performance potential. Future directions involve validating text modality integration, exploring MAiVAR-T's scalability on larger datasets, and innovatively combining generative AI-based transformer architectures for deeper insights into the transformer's impact on MHAR.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.