In a paper published in the journal PLOS ONE, researchers used explainable artificial intelligence (XAI) techniques to uncover the factors behind geographic variations in obesity prevalence across 3,142 United States of America (USA). Their findings revealed that machine learning models could explain 79% of the variance in obesity rates, pinpointing physical inactivity, diabetes, and smoking prevalence as the most significant contributors to these disparities.
Background
Understanding the primary determinants of health, particularly in the context of obesity research, is a critical endeavor. Many health behaviors and environmental factors influence the obesity crisis, exhibiting significant geographic disparities within the USA. While machine learning offers a potent tool for modeling the variation in obesity prevalence, it frequently shows opacity and poses challenges in interpretation. XAI has emerged as a response to this challenge, explicitly clarifying the insights acquired from predictive models.
Proposed Method
Data Acquisition: This study adheres to the cross-sectional study reporting guidelines outlined by the Strengthening the Reporting of Observational Studies in Epidemiology [14, S2 File]. It draws upon 2022 County Health Rankings data, aggregating health-related statistics across 3,142 US counties. Institutional review board approval is not required because the data is publicly available and non-identifiable. County-level obesity prevalence, a body mass index of ≥ 30, is the primary outcome for all analyses. The Behavioral Risk Factor Surveillance System calculates obesity prevalence using self-reported height and weight data. The dataset encompasses 64 variables spanning seven broad categories, including health outcomes, health behaviors, clinical care, social and economic factors, physical environment, demographics, and severe housing conditions.
Data Preparation: R version 4.2.1 facilitated the analyses, with the corresponding Rscript detailed in the S1 File. Omit variables with over 10% missing data and highly correlated variable pairs (Spearman correlation > ±0.90).
The analysis excludes values marked as unreliable in the County Health Rankings dataset, resulting in a dataset comprising 65 variables across 3,142 counties. Data analysis employs a stratified 2-fold cross-validation scheme for model training and evaluation, with the groupdata2 R package (version 2.0.2) facilitating partitioning.
Missing values are estimated separately for each data partition using the multivariate imputation by chained equations R package through 10 hints with 100 iterations. The final analysis utilizes median imputed values, with convergence assessed through trace lines of means and standard deviations. Notably, the prevalence of adults with obesity, the primary outcome, is not used for imputing any variable.
Analytical Approach: The study employs the iterative random forest R package (version 3.0.0) to construct a random forest prediction model for county-level obesity prevalence, incorporating a 2-fold cross-validation approach. This choice enhances the random forest model by producing more stable estimates of feature importance.
The modeling algorithm generates a forest of 1,000 decision trees separately for each data partition, with each decision tree relying on a random subset of 8 features selected from the 64 available. Feature importance is estimated based on the variance explained in the outcome across all decision trees. Subsequently, researchers generate a second prediction model with iterations weighted according to the importance of features from the first model.
This process iterates 100 times, keeping the best-performing model based on out-of-bag error. Researchers assess model performance by using the variance explained and calculating the mean absolute difference between predicted and actual prevalence in the evaluation data.
Insights from Local Effects: While the random forest model reveals feature importance, it doesn't elucidate the direction of the relationships. Accumulated local effects plots address this gap by showing how predicted obesity prevalence changes as feature values vary within specific ranges.
Overview of Surrogate Decision Tree: A global surrogate, represented by an interpretable decision tree, is trained to emulate the predictions of the random forest model. Researchers develop this tree using the R package (version 4.1.16) and prune it to ensure interoperability. A conservative complexity parameter produces an easily interpretable final surrogate decision tree.
Local Model-Agnostic Interpretations: Unlike the global surrogate, local interpretable model-agnostic explanations are surrogates for individual predictions. Researchers generate these local models using the lime R package (version 0.5.3), utilizing the plot_features() function to identify the model features influencing predictions for specific counties. The primary results feature local models for two exemplary counties, representing the lower and higher ends of the obesity prevalence distribution.
Study Results
In this study of 3,142 US counties, the average adult obesity prevalence was 35.7%, showing significant geographic variation. Researchers developed two prediction models, which collectively explained 79% of this variation. Physical inactivity emerged as the most influential factor, followed by diabetes and smoking. Surrogate models achieved local interpretability, providing insights into specific counties' obesity prevalence drivers. This research pioneers an interpretable machine learning model for county-level obesity prevalence using XAI. While researchers gained valuable insights, limitations such as self-reported data and a cross-sectional design hinder the ability to make causal claims. Nonetheless, this approach opens doors to better understanding and addressing the obesity crisis with data-driven precision.
Conclusion
In summary, XAI methods enhance transparency and interpretability in machine learning models, particularly in fields like obesity and biomedicine. XAI reveals crucial insights, such as the significance of physical inactivity in obesity, while promoting trust and ethical oversight in machine learning applications.
Additionally, XAI's transparency enables personalized treatment plans through local interpretable model-agnostic explanations, allowing for tailored interventions based on the unique characteristics of counties or individuals. Ultimately, XAI empowers researchers and clinicians to tackle the complexities of obesity more effectively, leading to improved prevention and treatment strategies.