A recent review published in the journal Environmental Research examined how well machine learning (ML) algorithms predict ambient air pollution levels compared to traditional statistical methods.
The researchers focused on three key pollutants: nitrogen dioxide (NO₂), ultrafine particles (UFPs), and black carbon (BC). These pollutants have high spatial and temporal variability and significant health impacts. The review aimed to determine if ML methods offer better performance than traditional statistical techniques in capturing these variations.
Background
Air pollution is a major global health issue, affecting illness and death rates worldwide. Accurate exposure assessments are crucial for understanding the risks of air pollution. Pollutants such as NO₂, UFPs, and BC fluctuate widely in space and time, making them challenging to model with traditional methods.
Land use regression (LUR) models are often used to estimate outdoor air pollution, but these models may struggle with the complex, non-linear relationships between pollution levels and environmental factors. ML techniques, which can better capture these non-linear relationships, have become increasingly popular in air pollution modeling.
About the Review
In this study, the authors aimed to assess the performance of ML methods in predicting ambient concentrations of NO₂, UFPs, and BC compared to statistical regression models. To identify relevant studies, they searched two major scientific databases, Scopus and Web of Science, for research published up to June 13, 2024.
The studies had to meet specific criteria: they needed to report spatial or spatiotemporal models using both ML and statistical regression methods for the same pollutants and datasets, focus on outdoor UFPs, NO₂, or BC, include a quantitative assessment of model performance, and be peer-reviewed articles with original data.
The researchers identified 38 eligible studies with 46 model comparisons. These studies were conducted in various countries, covering urban, regional, and global spatial extents. Detailed information on study designs, modeling methods, and performance metrics, including coefficient of determination (R²), and root mean square error (RMSE) was extracted.
Statistical methods ranged from linear regression techniques, like multiple linear regression (MLR) and stepwise linear regression (SLR), to nonlinear and regularized methods such as generalized additive models (GAM) and least absolute shrinkage and selection operator (LASSO). The ML methods included random forest (RF), artificial neural networks (ANN), extreme gradient boosting (XGBoost), and convolutional neural networks (CNN), among others.
Key Results
The review found that ML methods outperformed statistical regression models in 34 of the 46 model comparisons. On average, the best ML models showed an increase of 0.12 in R² and a 20% decrease in RMSE compared to the best statistical models. Tree-based methods, such as RF and XGBoost, were the most frequently used and best-performing ML approaches, surpassing other methods in 12 of 17 multi-model comparisons. Whereas ANN models often performed the worst among all the evaluated ML methods.
ML methods provided greater performance gains for spatiotemporal models (predicting hourly, daily, or monthly pollutant levels) compared to spatial models (predicting annual or seasonal averages). This may be due to linear non-regularized statistical methods, which struggled with the complexity of short-term pollutant variations.
Interestingly, nonlinear and regularized statistical regression methods, such as GAM and LASSO, sometimes performed similarly to ML models, especially for spatial models. This suggests that flexible regression techniques can match ML performance in some scenarios.
Applications
This review has significant implications for air pollution exposure assessment and epidemiological research. Accurate modeling of ambient air pollutant concentrations is crucial for estimating individual exposures and understanding the health impacts of air pollution. The superior performance of ML, particularly tree-based methods, in predicting spatiotemporal variations of NO₂, UFPs, and BC suggests that these techniques could improve exposure assessments and epidemiological studies.
The insights from this study can guide the choice of modeling approaches for different air pollutants and study designs. For example, nonlinear and regularized statistical methods may be more suitable for modeling spatial patterns, while ML techniques could be beneficial for spatiotemporal modeling.
Conclusion
The review summarized that ML methods, especially tree-based algorithms, generally outperformed traditional statistical regression techniques in predicting the spatial and temporal variations of key air pollutants. It highlighted the potential of ML to enhance air pollution exposure assessment and contribute to more accurate epidemiological studies.
The review also emphasized the need for further research to compare a broader range of statistical and ML methods and the importance of standardized reporting of methodologies and results. Future research should explore the performance of different ML algorithms, including deep learning methods, in various contexts, and prioritize the development of standardized reporting guidelines to ensure transparency, reproducibility, and comparability across studies.
Journal reference:
- Vachon, J., Kerckhoffs, J., Buteau, S., & Smargiassi, A. Do Machine Learning Methods Improve Prediction of Ambient Air Pollutants with High Spatial Contrast? A Systematic Review. Environment Research, 2024, 119751. DOI: 10.1016/j.envres.2024.119751, https://www.sciencedirect.com/science/article/pii/S0013935124016566