In a paper published in the journal PLOS ONE, researchers investigated the impact of different factor screening methods (FSMs) on predictive modeling accuracy. Using 2014 landslide data from Jingdong County, they created a database of 136 landslides and 11 factors.
The researchers employed various FSMs and utilized a random forest (RF) model for landslide susceptibility mapping (LSM). The study revealed that FSMs significantly improved model accuracy, with the Information Gain Ratio (IGR) and RF, called the IGR_RF model, performing the best. The factor weighting analysis (FWA) also identified the normalized difference vegetation index (NDVI), elevation, and aspect as the most influential factors in landslide prediction, providing valuable insights for future LSM.
Background
Landslides are a common geological threat in mountainous regions. Jingdong County in Yunnan Province, China, is particularly susceptible due to its location near the collision zone of the Indian Ocean and Eurasian plates. Historical landslides have caused significant human and economic losses. Preventing and reducing landslide hazards is a priority for local authorities.
LSM is vital for assessing landslide risks. With the advent of data mining and machine learning, models like random forests have gained popularity for LSM due to their predictive capabilities. However, various factors can affect model accuracy, including factor screening, sample selection, and model optimization. Factor screening, in particular, plays a critical role in data preparation and requires careful consideration.
Factor screening is a crucial step in LSM. This step involves selecting the most relevant input factors to enhance the model's performance. Researchers have used various methods for factor screening, each with its approach and advantages. Some standard FSMs include the multicollinearity test (MT), Pearson correlation coefficient (PCC), GeoDetector (GD), IGR, recursive feature elimination, rough set, frequency ratio, and deterministic coefficient. However, the choice of factor screening method can significantly impact the final model's accuracy and efficiency. Exploring and comparing these methods is essential to determine which is most beneficial for improving predictive performance in LSM.
Methods
The researchers based their study on the slope unit as the fundamental evaluation unit and structured their research methodology into several key stages. Initially, data related to historical landslides and relevant factors in Jingdong County underwent preprocessing. Non-landslide points were selected at a 1:1 ratio, guided by the available landslide point data. They created a landslide database and partitioned it into training and test datasets using a 7:3 ratio. Subsequently, the training data were subjected to four distinct FSMs to determine the most suitable factors for the subsequent modeling. Lastly, they introduced the selected factor sets into an RF model for LSM and meticulously analyzed the results.
Factor Screening Methods
The study utilized various factor screening methods, each characterized by its unique attributes, to enhance the prediction accuracy of the RF model. The selected methods included the IGR, GD, PCC, and MT. Categorized into two groups based on the type of factor data necessitated: reclassified and original (normalized) factors. IGR and GD used reclassified factor data, while PCC and MT operated with original (normalized) factors.
- IGR measures the importance of a factor in predicting landslide susceptibility, with higher IGR values indicating greater significance. The calculation process involves the consideration of the dataset's information gain and empirical entropy.
- GD is a statistical method to identify spatial differentiation and its influencing factors. In this context, it calculated the q value to measure the explanatory power of each factor on landslide spatial patterns.
- PCC assesses the degree of correlation between two factors. A high absolute PCC value indicates a strong correlation between the factors.
- Multicollinearity checks for excessive correlation between factors, which can lead to biased landslide predictions. The Variance Inflation Factor (VIF) and Tolerance (TOL) are used to evaluate factor correlations.
LSM with RF
The RF model, an ensemble algorithm, was employed for landslide susceptibility mapping. It involved constructing multiple decision trees and utilizing voting to make predictions. The critical steps in RF model construction included bootstrap sampling, the generation of decision trees, and prediction based on the decision tree ensemble. Researchers used Out-of-bag (OOB) datasets to evaluate model performance.
Accuracy Verification
To assess the accuracy and reliability of the model, the researchers conducted accuracy validation using confusion matrices and receiver operating characteristic (ROC) curves. They used the ROC curve to calculate the Area Under the Curve (AUC), which indicates the model's classification effectiveness. Additionally, precision, accuracy, recall, and F1 score were computed for model evaluation, utilizing a confusion matrix to compare model predictions against actual categories. This thorough validation process ensured a comprehensive assessment of the model's performance.
Results
The study evaluated different FSMs in landslide susceptibility mapping. Four methods - IGR, GD, PCC, and MT- were applied to assess the importance of various factors. The study consistently identified NDVI, elevation, and aspect as significant factors. Models with factor screening performed better than the model without factor elimination, with IGR_RF showing the highest predictive power. Factor screening proved crucial in enhancing the accuracy of landslide susceptibility mapping. Spatial distribution analysis revealed effective prediction in high susceptibility zones and emphasized removing redundant factors.
Conclusion
To sum up, this study underscores the significance of factor screening in enhancing the predictive accuracy of machine learning models for landslide susceptibility. Among the methods explored, IGR stands out for its ability to consider factor-landslide relationships and assign informative weight values. Notably, the study identifies NDVI, elevation, and aspect as pivotal factors, with specific NDVI values and elevations critical indicators of landslide occurrences. Ultimately, the IGR_RF model outperforms the others, offering valuable insights to guide future landslide prevention and management efforts as a helpful reference for researchers in factor screening.