In an article recently submitted to the Arxiv* server, researchers proposed a novel method for automatic image classification or image tagging in crisis response scenarios. It was based on transformer-based architectures, specifically a vision transformer (ViT) variant called CrisisViT.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Leveraging the Incidents1M crisis image dataset, the model outperformed previous methods in emergency type, image relevance, humanitarian category, and damage severity classification. The use of smartphones and social media enabled citizens to contribute valuable information, and the proposed CrisisViT model offered an efficient solution for crisis responders to analyze and categorize images rapidly, aiding in timely decision-making during emergencies.
Background
The increasing frequency of crisis events prompts the need for effective crisis response strategies. Social media platforms become vital sources of information during crises, as users share relevant content that can aid emergency responders. While previous studies have emphasized the significance of social media in crisis information acquisition, the sheer volume of posts necessitates automated tools to extract actionable information promptly. Previously, machine learning approaches focused on analyzing textual content from social media during crises. However, there is a growing recognition of the value of crisis-related images in providing essential information, aiding in resource allocation, and assessing damage severity.
Existing solutions often rely on deep convolutional neural networks (CNNs), pre-trained on non-crisis image datasets, for image classification. Nevertheless, the effectiveness of such models in accurately categorizing crisis imagery, especially in terms of disaster type, informativeness, humanitarian categories, and damage severity, raises concerns. This paper addressed the limitations of previous approaches by introducing CrisisViT. Unlike conventional CNNs, CrisisViT explored the pretraining of models using crisis imagery from the Incidents1M dataset, emphasizing in-domain learning for improved performance. The study compared CrisisViT models with established deep CNNs and ViT models on the Crisis Image Benchmark dataset, demonstrating significant accuracy improvements.
Methodology and Experimental Setup
The researchers explored the efficacy of pre-training a state-of-the-art transformer-based image classification model, ViT, on a large-scale crisis image dataset, Incidents1M. A new variant, CrisisViT, was proposed to enhance performance and robustness across various crisis image classification tasks. Two primary decisions controlled the model's construction, which were the choice of the pre-training dataset and the methodology for pre-training. The pre-training datasets considered were ImageNet-1k (representing general image classification) and Incidents1M (specialized in-domain crisis imagery).
Three pre-training strategies were employed: ImageNet-1k + Incidents1M, Incidents1M only, and self-supervised training. The Incidents1M dataset encompassed 43 incident categories and 49 place categories in total. Various pre-training tasks were explored, including binary, incident or place, dual, and self-supervised training. CrisisViT employed the ViT-base model architecture with different hyperparameters for each pre-training strategy. The experimental setup involved evaluating the model on the Crisis Image Benchmark dataset, which covered disaster type classification, informativeness, humanitarian categories, and damage severity tasks.
The comparison included popular models like ResNet101, EffiNet (b1), VGG16, and ViT-Base as baselines. Performance metrics, specifically classification accuracy, were used for evaluation, with each experiment conducted at least three times and results averaged. The authors aimed to determine how much a large-scale crisis image dataset improved transformer-based models' performance in crisis content categorization, providing insights into best practices during training. The model parameters included the use of the Adam optimizer, batch sizes of 1024 and 128 for self-supervised and supervised learning, and rectified linear unit (ReLU) activation function. Additionally, experiments explored different batch sizes and pre-training epochs on the Incidents1M dataset.
Study Results
The experimental results addressed three research questions regarding the impact of pre-training on the CrisisViT model using the Incidents1M crisis image dataset.
- ViT vs. Convolutional Neural Baselines: Transformer-based architecture ViT outperformed CNN baselines (ResNet101, EffiNet (b1), VGG16) across crisis image classification tasks. ViT demonstrated superior accuracy, particularly in disaster type classification, humanitarian category classification, and damage severity estimation.
- Pre-training using Incident Types and Place Categories: Pre-training CrisisViT with Incidents1M yielded improved performance over ImageNet-1k pre-training. Place category labels led to the best results, outperforming incident labels. Combining incident and place labels did not enhance performance significantly. The researchers concluded that pre-training with an in-domain dataset could yield performance gains, but the datasets for pre-training should be chosen selectively.
- ImageNet-1k + Incidents1M: Augmenting ImageNet-1k pre-training with Incidents1M did not consistently improve performance. While models pre-trained on incident or incident+places labels showed a small performance uplift, it remained unclear whether starting from a pre-trained ViT-Base model was superior to training a new model.
Conclusion
In conclusion, the researchers introduced CrisisViT, a transformer-based image classifier pre-trained on the Incidents1M crisis dataset for improved crisis image classification on social media. Experimentation on disaster type, informativeness, humanitarian category, and damage severity tasks showed significant accuracy gains, averaging 1.25%. The findings highlighted the potential of transformer-based models and Incidents1M for enhancing crisis response tools that leverage social media images for emergency efforts.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.