Video-FocalNets: Combining CNNs and ViTs for Efficient Video Action Recognition

Download PDF Copy

By Ashutosh RoyReviewed by Susha Cheriyedath, M.Sc.Jul 18 2023

Video recognition is critical in various domains, including surveillance systems, autonomous vehicles, and human-computer interaction. Recent advancements in video recognition models have been influenced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). To address the limitations of these approaches, a recent article posted to the arxiv* server introduces an innovative architecture called Video-FocalNet, which combines the strengths of CNNs and ViTs to achieve efficient and accurate video action recognition.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Efficient video action recognition

CNN-based methods have significantly improved image recognition tasks and have been extended to video recognition with remarkable success. However, the computational cost increases when dealing with 3D convolutions for modeling spatiotemporal relations in videos. This challenge led researchers to develop variants of 3D CNNs that reduce computational complexity while maintaining or improving performance. These approaches have shown promise, but there is still a need to overcome the limitations of CNNs in modeling long-range dependencies.

In contrast, Vision Transformers (ViTs) have emerged as a powerful alternative, leveraging self-attention mechanisms to encode both short- and long-range dependencies. ViTs have achieved impressive results in large-scale video recognition benchmarks, surpassing their CNN counterparts. However, their practical applicability is limited due to higher computational and parameter costs, hindering real-time video recognition tasks.

The architecture of video-FocalNets

Video-FocalNets introduce a novel architecture that combines the strengths of CNNs and ViTs while addressing their limitations. The spatiotemporal focal modulation technique is at the core of Video-FocalNets, which efficiently captures video contextual information. This technique employs a hierarchical contextualization process that utilizes depthwise and pointwise convolutions to capture spatial and temporal dependencies.

To implement the spatiotemporal focal modulation, the input spatiotemporal feature map is projected using linear layers to obtain queries, spatial and temporal feature maps, and spatial and temporal gates. These components are then utilized to generate spatial and temporal modulators, which encode the surrounding context for each query. The modulators are fused with the query tokens through element-wise multiplication, resulting in a final spatiotemporal feature map encapsulating local and global contexts.

Design exploration and optimization

To achieve optimal spatiotemporal context modeling, Video-FocalNets extensively explore various design configurations. The authors compared different design variations to identify the most effective approach. One such variation involves the extension of spatial focal modulation to videos, which proved to be a promising choice. Another design exploration includes factorized encoders, where spatial and temporal information is processed separately, allowing for better capture of short-range and long-range dependencies.

Additionally, the authors explore divided space-time attention, which decouples the spatial and temporal branches to independently extract and aggregate spatial and temporal context for each query token. Through meticulous experimentation on large-scale video recognition datasets, such as Kinetics-400, Kinetics-600, and Something-Something-v2, the researchers demonstrated that the proposed spatiotemporal focal modulation design consistently outperforms other design choices in terms of both accuracy and computational efficiency.

Comparison with previous methods

Video-FocalNets are rigorously evaluated against various previous methods, including CNN-based and transformer-based approaches, to assess their effectiveness in video recognition tasks. The results consistently show that Video-FocalNets outperform these methods across multiple video recognition benchmarks. By leveraging the efficient focal modulation technique, Video-FocalNets achieve higher accuracy while reducing computational costs. This optimal balance between effectiveness and efficiency makes Video-FocalNets a compelling solution for video action recognition.

T The success of Video-FocalNets is attributed to their ability to capture local and global contexts efficiently. This is accomplished through the fusion of spatial and temporal modulators with query tokens, which encode the surrounding context for each query. The resulting spatiotemporal feature map effectively captures the rich contextual information necessary for accurate video action recognition.

Potential applications and future directions

The development of Video-FocalNets opens up new possibilities for video recognition applications. Their efficient and accurate modeling of local and global contexts can significantly impact various fields, including surveillance systems, autonomous vehicles, and human-computer interaction. By improving the accuracy and efficiency of video action recognition, Video-FocalNets have the potential to enhance the capabilities of these systems, enabling better decision-making in real-time scenarios.

Looking ahead, further research and development can focus on exploring different architectural variations and evaluating their performance on additional video recognition datasets. Fine-tuning the design of Video-FocalNets may lead to even higher accuracy while maintaining computational efficiency. Additionally, applying transfer learning techniques and adapting Video-FocalNets to specific domains could further enhance their applicability in specialized video recognition tasks.

Conclusion

In conclusion, Video-FocalNets present an efficient and accurate architecture for video action recognition by combining the strengths of CNNs and ViTs. The spatiotemporal focal modulation technique enables precise modeling of local and global contexts, improving accuracy and computational efficiency. Through extensive experiments and comparisons, Video-FocalNets demonstrate superior performance over previous methods.

The development of Video-FocalNets marks a significant advancement in video recognition and opens up new possibilities for applications in various fields. With further research and development, Video-FocalNets have the potential to revolutionize video analysis systems, providing enhanced capabilities for understanding and interpreting visual data.

Journal reference:

Preliminary scientific report. Syed Talal Wasim et al. (2023) Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition. DOI: https://doi.org/10.48550/arXiv.2307.06947, https://arxiv.org/abs/2307.06947

Posted in: AI Research News

Comments (0)

Written by

Ashutosh Roy

Ashutosh Roy has an MTech in Control Systems from IIEST Shibpur. He holds a keen interest in the field of smart instrumentation and has actively participated in the International Conferences on Smart Instrumentation. During his academic journey, Ashutosh undertook a significant research project focused on smart nonlinear controller design. His work involved utilizing advanced techniques such as backstepping and adaptive neural networks. By combining these methods, he aimed to develop intelligent control systems capable of efficiently adapting to non-linear dynamics.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Roy, Ashutosh. (2023, July 18). Video-FocalNets: Combining CNNs and ViTs for Efficient Video Action Recognition. AZoAi. Retrieved on July 15, 2025 from https://www.azoai.com/news/20230718/Video-FocalNets-Combining-CNNs-and-ViTs-for-Efficient-Video-Action-Recognition.aspx.
MLA
Roy, Ashutosh. "Video-FocalNets: Combining CNNs and ViTs for Efficient Video Action Recognition". AZoAi. 15 July 2025. <https://www.azoai.com/news/20230718/Video-FocalNets-Combining-CNNs-and-ViTs-for-Efficient-Video-Action-Recognition.aspx>.
Chicago
Roy, Ashutosh. "Video-FocalNets: Combining CNNs and ViTs for Efficient Video Action Recognition". AZoAi. https://www.azoai.com/news/20230718/Video-FocalNets-Combining-CNNs-and-ViTs-for-Efficient-Video-Action-Recognition.aspx. (accessed July 15, 2025).
Harvard
Roy, Ashutosh. 2023. Video-FocalNets: Combining CNNs and ViTs for Efficient Video Action Recognition. AZoAi, viewed 15 July 2025, https://www.azoai.com/news/20230718/Video-FocalNets-Combining-CNNs-and-ViTs-for-Efficient-Video-Action-Recognition.aspx.