In the film content structure, a shot serves as a fundamental unit, offering insights into a director's vision. In a recent publication in the journal Scientific Reports, researchers explored shot-type classification, advocating for the use of multimodal video inputs to enhance accuracy.
Background
Video classification encompasses various tasks in computer vision, including action recognition, micro-video classification, and video emotion classification, all reliant on spatio-temporal information extraction. Movie analysis faces unique challenges due to lengthy durations, prompting the use of shot or scene segmentation. The current study centers on shot type classification, examining intrinsic attributes such as shot movement and scale.
Related work
In film analysis, previous research has encompassed various aspects, from movie dataset construction to shot type classification. Regarding shot movement classification, traditional methods relied on manually designed features, while deep learning approaches such as the camera motion classification model (RO-TextCNN) and subject-guided network (SGNet) use optical flow information. In shot-scale classification, traditional approaches used low-level texture features, while recent methods incorporated convolutional neural networks and multiple input modalities such as segmentation and saliency maps.
The authors introduce the FullShots dataset, comprising 27K shots from 19 films, annotated with scale and movement labels. They propose the Lightweight Weak Semantic Relevance Network (LWSRNet), emphasizing lightweight, multi-modal input networks for cinematographic shot classification.
LWSRNet Architecture: To efficiently address variable-length time-series data, we adopt a frame sampling approach inspired by the temporal segment network (TSN) for processing a shot, which consists of multiple frames. Shots are divided into segments, and one frame is randomly sampled from each segment. In MovieShots, there are 8 segments, while in FullShots, which typically have longer shot durations, there are 16 segments.
LWSRNet posits that shot attributes such as movement and scale are weakly correlated with high-level semantic information. Instead, they relate more to low-level spatio-temporal features like texture information. To capture these features, the network reduces depth to lower parameters and computational complexity. Additionally, the authors employ a shallow 3D convolutional (C3D) network as the backbone, efficiently capturing spatio-temporal features for shot classification. Information supplement strategies are introduced for shot movement and scale classification.
The Linear Modality Fusion Module (LMF) fuses multiple video modalities, including frames, optical flow maps, segmentation maps, and saliency maps. These modalities are introduced to enhance shot classification. A linear 3D convolution layer processes each modality, concatenates them, and then applies adaptive pooling and channel weight assignment using a squeeze and extraction block.
The Weak Semantic Feature Extraction Module (WSFE) utilizes a shallow 3D-CNN backbone for feature extraction. The movement branch enhances movement classification by adding an extra non-linear 3D convolution layer. The scale branch is introduced for scale classification and utilizes texture information from the original frames.
The loss functions include cross-entropy loss for scale classification and focal loss for movement classification, addressing imbalances in the dataset. The proposed architecture aims to effectively classify shot types with improved efficiency and performance.
Dataset construction
The dataset MovieShots is an existing benchmark dataset for shot type classification. It contains 46K shot clips from 7,858 movie trailers. It is annotated with five scale categories and four movement categories. Recognizing the diversity of shots in movies beyond the subject-centric lens of MovieShots, the authors introduce the FullShots dataset. FullShots comprises 27,000 shots extracted from 19 complete movies, each uniformly annotated with shot scale and movement labels. For shot categorization, eight shot movements and six types of scales are defined.
The FullShots dataset construction process involves using the PySceneDetect Library to create approximately 32K shot samples per movie, with manual removal of ineffective shots and re-segmentation where needed. Trained personnel conducted two rounds of annotations, with a group leader making the final determinations. Comparisons are drawn between FullShots and other movie shot classification datasets regarding shot samples, source videos, and shot duration distributions.
Experiments and analysis
The model LWSRNet was evaluated on both the MovieShots and FullShots datasets. A sparse temporal sampling strategy is employed for each shot clip, with eight frames for MovieShots and 16 frames for FullShots. The results on the datasets MovieShots and FullShots reveal that traditional methods perform poorly compared to deep learning methods, indicating the inadequacy of hand-designed features for shot classification. Among deep learning methods, the resent model (I3D-ResNet50 (img)) excels in accuracy in movement (AccM) but lags in accuracy in scale (AccS), suggesting 3D-CNNs' effectiveness in learning temporal features. SGNet (img+flow) improves AccS and AccM compared to TSN-ResNet50 (img+flow). LWSRNet achieves significant improvements in AccM and a slight boost in AccS, validating its effectiveness. A parameter analysis reveals that LWSRNet is significantly more parameter-efficient than SGNet, with 48 percent fewer parameters and 55 percent fewer Gflops while achieving better results.
Ablation studies are conducted to assess various aspects of the model. Different backbone layers and 3D-CNN backbones are analyzed, with a 3-layer C3D performing well and C3D as the preferred backbone. Multi-modal input analysis shows improvements with optimal flow as an input modality, and the LMF Module effectively allocates weights to modalities. For scale classification, using segmentation and saliency as additional inputs improves performance, but in FullShots, saliency alone achieves higher accuracy. Finally, the movement and scale branches significantly improve model performance, highlighting their importance.
Conclusion
In summary, the authors introduce FullShots, a comprehensive shot dataset that expands beyond the MovieShots benchmark. They also present the LWSRNet model for cinematographic shot classification. Experimental results show LWSRNet's superior performance on FullShots and MovieShots with fewer parameters and computations. This work significantly advances cinematography analysis by enhancing shot classification accuracy and providing a valuable dataset for future research.
Journal reference:
Li, Yuzhi., Lu, Tianfeng., and Tian, Feng. (2023). A lightweight weak semantic framework for cinematographic shot classification, Scientific Reports, 16089, 13 (1). DOI: https://doi.org/10.1038/s41598-023-43281-w, https://www.nature.com/articles/s41598-023-43281-w