FunQA - Elevating Video Understanding to New Heights Using Question-Answering Datasets

Videos that elicit surprise, such as humorous clips, innovative performances, or optical illusions, tend to garner considerable interest. The appreciation of these videos is not solely based on the visual stimuli they present. Instead, it relies on the cognitive ability of individuals to comprehend and value the portrayal of violations of common sense within these videos.

Nevertheless, despite the notable progress made in contemporary computer vision models, a lingering inquiry persists: to what extent can video models comprehend the elements of humor and creativity exhibited in unexpected videos? Previous research has primarily concentrated on improving the performance of computer vision models in video question answering (VideoQA) by focusing on the typical and less unexpected videos present in current VideoQA datasets.

A recent study submitted to the arxiv* server presents FunQA, a novel video question-answering (QA) dataset that aims to evaluate and improve the level of video reasoning by utilizing counter-intuitive and entertaining videos.

Study: FunQA - Elevating Video Understanding to New Heights Using Question-Answering Datasets. Image credit: Metamorworks / Shutterstock
Study: FunQA - Elevating Video Understanding to New Heights Using Question-Answering Datasets. Image credit: Metamorworks / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

What is video comprehension?

The allure of unexpected videos, whether they are humorous, innovative, or feature optical illusions, provides pleasure and captivates viewers’ attention. This form of media evokes a sense of positive astonishment, a captivating emotion that arises not solely from observing superficial visual stimuli but from humans’ inherent capacity to comprehend and derive pleasure from unforeseen and counter-intuitive occurrences. Exact video comprehension is required to bridge this discrepancy and assess computer vision models' ability to detect and comprehend visual instances that contravene common sense in video footage.

The visual question-answering (VQA) task primarily aims to improve models' capacity for understanding images. In contrast, video question answering (VideoQA) emphasizes comprehending videos. As such, VideoQA poses more significant challenges compared to VQA due to the need for a thorough comprehension of visual content, the utilization of temporal and spatial information, and the exploration of relationships between recognized objects and activities.

What does this study involve?

Previous research has investigated different models, such as long short-term memory (LSTM) and graph-based neural networks, in order to capture cross-modal information in videos. The introduction of transformers has facilitated the emergence of video understanding models.  These models are designed to comprehend specific frames within a video. Later iterations of these models expanded their capacity to incorporate both temporal and spatial data. Nevertheless, the application of these methods has predominantly been limited to videos of shorter duration. In addition, recent studies on vision language models (VLMs) have demonstrated impressive abilities in comprehending videos, as evidenced by the works of references.

While the predominant emphasis of numerous contemporary computer vision benchmarks lies in comprehending commonsense content, there is an increasing inclination towards exploring the domain of counter-intuitiveness. Moreover, certain studies also present a challenge to existing models by examining their ability to understand intricate multimodal humor depicted in comic strips.

In this study, the authors present FunQA, an extensive and reliable VideoQA dataset that consists of 4.3K captivating videos and 312K free-text QA pairs that have been meticulously annotated by human experts. The dataset comprises three distinct subsets. The three question-answering models under consideration are HumorQA, CreativeQA, and MagicQA. Each subset encompasses distinct sources and video contents, yet, they share a common attribute in their remarkable characteristics, such as the unforeseen juxtapositions in comedic videos, the captivating disguises in creative videos, and the seemingly implausible feats in magic videos.

Major findings

Based on the researchers’ experimental findings, it is evident that the analysis of these unexpected videos necessitates a distinct form of reasoning compared to conventional videos. This is supported by the inadequate performance of existing VideoQA techniques on the dataset FunQA. The primary objective of FunQA is to establish a benchmark that encompasses the widely recognized, significant, and intricate category of counter-intuitive or surprising videos.

The significant contributions of this work include:

  • The authors have developed several innovative tasks that enable the model to investigate previously unexplored challenges, such as timestamp localization and reasoning related to counter-intuitiveness.
  • The researchers have conducted a thorough and comprehensive assessment of state-of-the-art benchmarks, providing the field with valuable insights and guiding future research endeavors.
  • The authors have identified that conventional evaluation measures produce minimal scores when applied to free-text questions, as they exclusively prioritize assessing short textual similarity.

Conclusion

Surprising videos, i.e., funny, innovative, or full of visual illusions, entertain and grab attention. This media produces pleasant surprises, a fascinating experience from humans' intrinsic ability to understand and enjoy unexpected and counterintuitive occurrences. “FunQA,” an innovative video question-answering dataset uses videos that defy common sense to test and enhance the level of video reasoning. It was seen that caption-based models, which prioritize captioning tasks, tend to generate descriptions for the entire video even when they aim to localize specific timestamps. The researchers have performed a comprehensive and rigorous evaluation of the most advanced benchmarks currently available, offering valuable insights to the academic community and providing guidance for future research efforts.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Xie, B., Zhang, S., Zhou, Z., Li, B., Zhang, Y., Hessel, J., Yang, J., & Liu, Z. (n.d.). FunQA: Towards Surprising Video Comprehension. Retrieved July 3, 2023, https://arxiv.org/abs/2306.14899
Dr. Sampath Lonka

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Lonka, Sampath. (2023, July 06). FunQA - Elevating Video Understanding to New Heights Using Question-Answering Datasets. AZoAi. Retrieved on December 22, 2024 from https://www.azoai.com/news/20230704/FunQA-Elevating-Video-Understanding-to-New-Heights-Using-Question-Answering-Datasets.aspx.

  • MLA

    Lonka, Sampath. "FunQA - Elevating Video Understanding to New Heights Using Question-Answering Datasets". AZoAi. 22 December 2024. <https://www.azoai.com/news/20230704/FunQA-Elevating-Video-Understanding-to-New-Heights-Using-Question-Answering-Datasets.aspx>.

  • Chicago

    Lonka, Sampath. "FunQA - Elevating Video Understanding to New Heights Using Question-Answering Datasets". AZoAi. https://www.azoai.com/news/20230704/FunQA-Elevating-Video-Understanding-to-New-Heights-Using-Question-Answering-Datasets.aspx. (accessed December 22, 2024).

  • Harvard

    Lonka, Sampath. 2023. FunQA - Elevating Video Understanding to New Heights Using Question-Answering Datasets. AZoAi, viewed 22 December 2024, https://www.azoai.com/news/20230704/FunQA-Elevating-Video-Understanding-to-New-Heights-Using-Question-Answering-Datasets.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Researchers Supercharge Depth Estimation Models, Achieving 200x Faster Results with New Fix