Videos that elicit surprise, such as humorous clips, innovative performances, or optical illusions, tend to garner considerable interest. The appreciation of these videos is not solely based on the visual stimuli they present. Instead, it relies on the cognitive ability of individuals to comprehend and value the portrayal of violations of common sense within these videos.
Nevertheless, despite the notable progress made in contemporary computer vision models, a lingering inquiry persists: to what extent can video models comprehend the elements of humor and creativity exhibited in unexpected videos? Previous research has primarily concentrated on improving the performance of computer vision models in video question answering (VideoQA) by focusing on the typical and less unexpected videos present in current VideoQA datasets.
A recent study submitted to the arxiv* server presents FunQA, a novel video question-answering (QA) dataset that aims to evaluate and improve the level of video reasoning by utilizing counter-intuitive and entertaining videos.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
What is video comprehension?
The allure of unexpected videos, whether they are humorous, innovative, or feature optical illusions, provides pleasure and captivates viewers’ attention. This form of media evokes a sense of positive astonishment, a captivating emotion that arises not solely from observing superficial visual stimuli but from humans’ inherent capacity to comprehend and derive pleasure from unforeseen and counter-intuitive occurrences. Exact video comprehension is required to bridge this discrepancy and assess computer vision models' ability to detect and comprehend visual instances that contravene common sense in video footage.
The visual question-answering (VQA) task primarily aims to improve models' capacity for understanding images. In contrast, video question answering (VideoQA) emphasizes comprehending videos. As such, VideoQA poses more significant challenges compared to VQA due to the need for a thorough comprehension of visual content, the utilization of temporal and spatial information, and the exploration of relationships between recognized objects and activities.
What does this study involve?
Previous research has investigated different models, such as long short-term memory (LSTM) and graph-based neural networks, in order to capture cross-modal information in videos. The introduction of transformers has facilitated the emergence of video understanding models. These models are designed to comprehend specific frames within a video. Later iterations of these models expanded their capacity to incorporate both temporal and spatial data. Nevertheless, the application of these methods has predominantly been limited to videos of shorter duration. In addition, recent studies on vision language models (VLMs) have demonstrated impressive abilities in comprehending videos, as evidenced by the works of references.
While the predominant emphasis of numerous contemporary computer vision benchmarks lies in comprehending commonsense content, there is an increasing inclination towards exploring the domain of counter-intuitiveness. Moreover, certain studies also present a challenge to existing models by examining their ability to understand intricate multimodal humor depicted in comic strips.
In this study, the authors present FunQA, an extensive and reliable VideoQA dataset that consists of 4.3K captivating videos and 312K free-text QA pairs that have been meticulously annotated by human experts. The dataset comprises three distinct subsets. The three question-answering models under consideration are HumorQA, CreativeQA, and MagicQA. Each subset encompasses distinct sources and video contents, yet, they share a common attribute in their remarkable characteristics, such as the unforeseen juxtapositions in comedic videos, the captivating disguises in creative videos, and the seemingly implausible feats in magic videos.
Major findings
Based on the researchers’ experimental findings, it is evident that the analysis of these unexpected videos necessitates a distinct form of reasoning compared to conventional videos. This is supported by the inadequate performance of existing VideoQA techniques on the dataset FunQA. The primary objective of FunQA is to establish a benchmark that encompasses the widely recognized, significant, and intricate category of counter-intuitive or surprising videos.
The significant contributions of this work include:
- The authors have developed several innovative tasks that enable the model to investigate previously unexplored challenges, such as timestamp localization and reasoning related to counter-intuitiveness.
- The researchers have conducted a thorough and comprehensive assessment of state-of-the-art benchmarks, providing the field with valuable insights and guiding future research endeavors.
- The authors have identified that conventional evaluation measures produce minimal scores when applied to free-text questions, as they exclusively prioritize assessing short textual similarity.
Conclusion
Surprising videos, i.e., funny, innovative, or full of visual illusions, entertain and grab attention. This media produces pleasant surprises, a fascinating experience from humans' intrinsic ability to understand and enjoy unexpected and counterintuitive occurrences. “FunQA,” an innovative video question-answering dataset uses videos that defy common sense to test and enhance the level of video reasoning. It was seen that caption-based models, which prioritize captioning tasks, tend to generate descriptions for the entire video even when they aim to localize specific timestamps. The researchers have performed a comprehensive and rigorous evaluation of the most advanced benchmarks currently available, offering valuable insights to the academic community and providing guidance for future research efforts.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., Hessel, J., Yang, J., & Liu, Z. (n.d.). FunQA: Towards Surprising Video Comprehension. Retrieved July 3, 2023, https://arxiv.org/abs/2306.14899