In an article published in the journal Nature, researchers addressed the challenge of monitoring and issuing warnings for crowd density in highly aggregated tourist crowds (HATCs) in popular tourist destinations.
They proposed a VGGT-Count network model to forecast HATC density, aiming to improve counting accuracy and enable real-time monitoring. The model's effectiveness was demonstrated through experiments on ShanghaiTech B and UCF-QNRF datasets, offering timely information for implementing crowd control and safety measures in tourist areas.
Background
The era of mass tourism has witnessed a surge in people's travel demands, particularly during holidays and at popular scenic spots, leading to HATCs and increased security risks. Incidents like stampedes and overcrowding have underscored the urgent need for effective monitoring and early warning systems in tourist destinations. While previous research has examined crowd behavior and safety in various contexts, there remains a scarcity of comprehensive investigations into real-time monitoring and early warning systems specifically tailored for HATCs.
Existing methods for assessing crowd density suffer from limited accuracy, especially in large-scale and chaotic crowd movements. Computer vision-based approaches have shown promise, but their application in real-time monitoring of HATCs has been underexplored. This study aimed to bridge this gap by proposing a VGGT-Count network model to estimate HATC density in tourist destinations. The model integrated the VGG-19 network for feature extraction with a transformer encoder equipped with multi-head attention mechanisms to predict crowd density maps.
By focusing on HATCs, typically comprising more than 50 tourists per square meter, this research addressed the specific challenges associated with crowd safety in tourist areas. The proposed model enabled real-time assessment of crowd density, allowing for timely warnings and implementation of crowd control measures. This filled a critical gap in tourism safety research by providing a more accurate and effective method for managing HATCs and minimizing security risks in tourist destinations.
Additionally, by leveraging advancements in computer vision and artificial intelligence, this study extended insights into crowd counting within the domain of tourism safety, offering valuable decision support for emergency management agencies and enhancing overall tourism safety guarantees.
Methods
The proposed framework, VGGT-Count, aimed to estimate crowd density in HATCs using a combination of VGG-19 as the backbone architecture and transformer-based encoding. Firstly, features were extracted from crowd images using VGG-19, followed by transmission to a transformer encoder equipped with multi-head attention to capture features across different scales. Subsequently, a regression decoder predicted the final density map. Local attention regularization and instance attention loss were applied to enhance model training and ensure effective self-attention.
The transformer encoder was comprised of four identical layers, each consisting of two sub-layers, a multi-head self-attention mechanism, and a feed-forward network. The self-attention layer facilitated global relation consideration, while the feed-forward network processes featured independently at each position. Multi-head attention allowed for parallel computation of associative relationships within different receptive windows, facilitating cross-scale interaction and fusion.
The feed-forward network, with rectified linear unit (ReLU) activation, performed linear transformations independently at each position. Three loss functions were utilized, counting loss, optimal transport (OT) loss, and total variation (TV) loss. The counting loss minimized the disparity between the forecasted and actual quantity of individuals in the crowd. The OT loss reduced the distribution gap between ground truth and predicted density maps by transforming unnormalized density functions into probability density functions.
Additionally, the TV loss stabilized sparse crowd area approximation by complementing the OT loss. The overall loss is a combined contribution from counting, OT, and TV losses, with tunable hyper-parameters. In summary, the VGGT-Count framework integrated VGG-19 with Transformer-based encoding for crowd density estimation in HATCs. It leveraged multi-head attention and feed-forward networks within the transformer encoder while employing multiple loss functions to optimize model performance and enhance density map accuracy.
Experiment
In the experiment section, the proposed VGGT-Count framework was implemented and evaluated for crowd counting on three datasets: ShanghaiTech Part A, ShanghaiTech Part B, and UCF-QNRF. Pre-trained VGG-19 convolutional neural network (CNN) was utilized as the backbone architecture for feature extraction. A transformer encoder-decoder structure was employed, and a unique self-attention module replaced the attention mechanism.
The regression decoder comprised an upsampling layer and three convolution layers with ReLU activation functions. Training images underwent random scaling, horizontal flipping, and arbitrary cropping to ensure diverse training data. Evaluation metrics included mean absolute error (MAE) and mean squared error (MSE), which assessed the accuracy and robustness of crowd-counting estimates.
The study compared VGGT-Count with 11 recent state-of-the-art methods across the three datasets. VGGT-Count achieved impressive accuracy, outperforming many methods on all datasets. It surpassed the baseline method regarding MAE and MSE reduction across all datasets. Additionally, real-time performance analysis showed that VGGT-Count achieved a balance between model size and prediction speed, exhibiting high frame rates and low inference times compared to other methods with similar model sizes.
Discussion
The authors elaborated on the comparison between VGGT-Count and DM-Count models in crowd-counting prediction. VGGT-Count leveraged transformer attention mechanisms to capture regional variations and contextual information, resulting in more accurate and stable predictions under complex conditions. However, challenges came up with low-resolution images and occlusion prediction, where the model's performance might degrade due to reduced detail visibility and obscured individuals.
Regarding application and feasibility, VGGT-Count held promise for real-world tourism management scenarios. It could aid tourism authorities in crowd monitoring and control by providing real-time density information for popular destinations. Integration into mobile apps or tourist systems allowed tourists to make informed decisions about their itinerary, enhancing their overall experience. Thus, VGGT-Count demonstrated significant potential in improving crowd management and enhancing tourist experiences.
Conclusion
In conclusion, the VGGT-Count network model presented in this study demonstrated high accuracy in estimating crowd density, particularly in HATCs. Through the integration of VGG-19 and transformer-based encoding, coupled with multi-head attention mechanisms, the model achieved precise predictions even in complex scenarios.
Real-time monitoring and early warning systems tailored to distinct density thresholds offered practical solutions for crowd management in tourist destinations. The model's effectiveness across various datasets and its applicability in real-world tourism scenarios highlighted its potential for enhancing crowd safety and management in high-density areas.