In a paper published in the journal Applied Sciences, researchers evaluated the impact of non-landslide sample selection using whole-region random sampling (I) and partition-based random sampling (II). Using 834 landslide points and 10 environmental factors, they applied random forest (RF) and back propagation neural network (BPNN) models. The II-BPNN model achieved the highest accuracy, significantly outperforming the other models. Results showed that partition-based sampling improved prediction accuracy, recall, and specificity, enhancing the models' predictive capabilities.
Past work on landslide hazard risk assessment highlights its three key components: susceptibility, hazard, and risk assessment. Data-driven machine learning (ML) models are gaining popularity for their ability to handle complex, nonlinear relationships. They also provide accurate and reliable predictions. However, selecting non-landslide samples relies on subjective methods, introducing biases.
Methodology Overview
The frequency ratio (FR) method estimates the probability of landslides happening within different intervals of environmental factors. Using the arc geographic information system (ArcGIS) 10.8 for spatial analysis, the extent of landslide occurrences in grid cells is obtained, and the FR value is calculated. An FR value greater than 1 signifies a substantial impact of the environmental factor on landslides within that interval. Conversely, an FR value less than 1 indicates a minimal influence of the factor.
Based on information theory, the information value model predicts regional geological disasters by evaluating the likelihood of disasters occurring in an area. The information value associated with each environmental factor is determined by comparing the number of units with the factor where disasters happen to the total count of such units in the study area. The cumulative information value for each evaluation unit is calculated by summing the information values across various states of the factors, indicating the likelihood of disaster occurrence.
RF is an ensemble learning method that combines multiple decision trees (DT) to improve model performance. It uses bootstrap sampling to create different training sets and selects a random subset of features at each split to ensure tree diversity. It mitigates overfitting risks and improves generalization. The BPNN, used for classification and regression, consists of an input layer, two hidden layers with 10 and 6 neurons using rectified linear unit (ReLU) activation, and an output layer with a softmax function and two nodes.
Located in central-eastern China, Henan Province spans 166,000 square kilometers. The region is characterized by various landscapes, such as plains, mountains, and hills. This study specifically targets western Henan's mountainous regions, with elevations between 500 and 2000 meters. By the end of 2017, 834 landslides, categorized as loess landslides and rockslides, had been recorded.
Landslide susceptibility evaluation incorporated ten environmental variables: elevation, slope, aspect, profile curvature, land cover, geological composition, topographic wetness index, distance to rivers, distance to fault lines, and proximity to roads. The analyses used a GIS to extract raster layers of each factor at a 30 m resolution, and FR values were calculated to determine each factor's influence on landslide occurrence.
Landslide Susceptibility Analysis
The study focused on landslide susceptibility in western Henan Province, employing historical data and various modeling techniques. Using the information value model and FR method, the study analyzed environmental factors and assessed landslide susceptibility through RF and BPNN models. The process involved compiling landslide data, determining FR values, assessing susceptibility levels, and predicting landslide susceptibility indices. The models generated susceptibility maps using positive and negative samples, which were then evaluated for accuracy.
The study optimized negative sample selection by dividing areas into different susceptibility zones and comparing results from various sampling methods. The analysis revealed that the II-BPNN model best predicted high-risk regions, followed by the II-RF, I-RF, and I-BPNN models. The results highlighted significant differences in model performance based on negative sample selection and model type. The receiver operating characteristic (ROC) curve analysis showed that the II-BPNN model had the highest accuracy, with a notable improvement in distinguishing between landslide and non-landslide areas.
The RF models were effective in structured data processing, while BPNN models excelled in handling complex, nonlinear relationships. The study emphasized the importance of selecting the appropriate model based on the specific needs of landslide susceptibility tasks and the characteristics of the data.
Conclusion
To sum up, the study evaluated landslide susceptibility in western Henan Province using ten environmental factors and employed frequency ratio and random sampling methods for RF and BPNN models. It found that high-risk areas showed specific characteristics like 400–1000 m elevations and slopes of 10°–30°. Models II-RF and II-BPNN exhibited better accuracy in identifying high-susceptibility zones than the I-RF and I-BPNN models, achieving ROC accuracies of 0.9464 and 0.9522, respectively. Optimizing negative sample areas improved prediction accuracy by avoiding non-landslide samples near landslide boundaries.
Journal reference:
- Wang, X., et al. (2024). Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning. Applied Sciences, 14:14, 6040. DOI: 10.3390/app14146040, https://www.mdpi.com/2076-3417/14/14/6040