Unveiling Obesity Determinants: Interpretable Machine Learning at the County Level

In a paper published in the journal PLOS ONE, researchers used explainable artificial intelligence (XAI) techniques to uncover the factors behind geographic variations in obesity prevalence across 3,142 United States of America (USA). Their findings revealed that machine learning models could explain 79% of the variance in obesity rates, pinpointing physical inactivity, diabetes, and smoking prevalence as the most significant contributors to these disparities.

Study: Unveiling Obesity Determinants: Interpretable Machine Learning at the County Level. Image credit: Tero Vesalainen/Shutterstock
Study: Unveiling Obesity Determinants: Interpretable Machine Learning at the County Level. Image credit: Tero Vesalainen/Shutterstock

Background

Understanding the primary determinants of health, particularly in the context of obesity research, is a critical endeavor. Many health behaviors and environmental factors influence the obesity crisis, exhibiting significant geographic disparities within the USA. While machine learning offers a potent tool for modeling the variation in obesity prevalence, it frequently shows opacity and poses challenges in interpretation. XAI has emerged as a response to this challenge, explicitly clarifying the insights acquired from predictive models.

Proposed Method

Data Acquisition: This study adheres to the cross-sectional study reporting guidelines outlined by the Strengthening the Reporting of Observational Studies in Epidemiology [14, S2 File]. It draws upon 2022 County Health Rankings data, aggregating health-related statistics across 3,142 US counties. Institutional review board approval is not required because the data is publicly available and non-identifiable. County-level obesity prevalence, a body mass index of ≥ 30, is the primary outcome for all analyses. The Behavioral Risk Factor Surveillance System calculates obesity prevalence using self-reported height and weight data. The dataset encompasses 64 variables spanning seven broad categories, including health outcomes, health behaviors, clinical care, social and economic factors, physical environment, demographics, and severe housing conditions.

Data Preparation: R version 4.2.1 facilitated the analyses, with the corresponding Rscript detailed in the S1 File. Omit variables with over 10% missing data and highly correlated variable pairs (Spearman correlation > ±0.90).

The analysis excludes values marked as unreliable in the County Health Rankings dataset, resulting in a dataset comprising 65 variables across 3,142 counties. Data analysis employs a stratified 2-fold cross-validation scheme for model training and evaluation, with the groupdata2 R package (version 2.0.2) facilitating partitioning.

Missing values are estimated separately for each data partition using the multivariate imputation by chained equations R package through 10 hints with 100 iterations. The final analysis utilizes median imputed values, with convergence assessed through trace lines of means and standard deviations. Notably, the prevalence of adults with obesity, the primary outcome, is not used for imputing any variable.

Analytical Approach: The study employs the iterative random forest R package (version 3.0.0) to construct a random forest prediction model for county-level obesity prevalence, incorporating a 2-fold cross-validation approach. This choice enhances the random forest model by producing more stable estimates of feature importance.

The modeling algorithm generates a forest of 1,000 decision trees separately for each data partition, with each decision tree relying on a random subset of 8 features selected from the 64 available. Feature importance is estimated based on the variance explained in the outcome across all decision trees. Subsequently, researchers generate a second prediction model with iterations weighted according to the importance of features from the first model.

This process iterates 100 times, keeping the best-performing model based on out-of-bag error. Researchers assess model performance by using the variance explained and calculating the mean absolute difference between predicted and actual prevalence in the evaluation data.

Insights from Local Effects: While the random forest model reveals feature importance, it doesn't elucidate the direction of the relationships. Accumulated local effects plots address this gap by showing how predicted obesity prevalence changes as feature values vary within specific ranges. 

Overview of Surrogate Decision Tree: A global surrogate, represented by an interpretable decision tree, is trained to emulate the predictions of the random forest model. Researchers develop this tree using the R package (version 4.1.16) and prune it to ensure interoperability. A conservative complexity parameter produces an easily interpretable final surrogate decision tree.

Local Model-Agnostic Interpretations: Unlike the global surrogate, local interpretable model-agnostic explanations are surrogates for individual predictions. Researchers generate these local models using the lime R package (version 0.5.3), utilizing the plot_features() function to identify the model features influencing predictions for specific counties. The primary results feature local models for two exemplary counties, representing the lower and higher ends of the obesity prevalence distribution.

Study Results

In this study of 3,142 US counties, the average adult obesity prevalence was 35.7%, showing significant geographic variation. Researchers developed two prediction models, which collectively explained 79% of this variation. Physical inactivity emerged as the most influential factor, followed by diabetes and smoking. Surrogate models achieved local interpretability, providing insights into specific counties' obesity prevalence drivers. This research pioneers an interpretable machine learning model for county-level obesity prevalence using XAI. While researchers gained valuable insights, limitations such as self-reported data and a cross-sectional design hinder the ability to make causal claims. Nonetheless, this approach opens doors to better understanding and addressing the obesity crisis with data-driven precision.

Conclusion

In summary, XAI methods enhance transparency and interpretability in machine learning models, particularly in fields like obesity and biomedicine. XAI reveals crucial insights, such as the significance of physical inactivity in obesity, while promoting trust and ethical oversight in machine learning applications.

Additionally, XAI's transparency enables personalized treatment plans through local interpretable model-agnostic explanations, allowing for tailored interventions based on the unique characteristics of counties or individuals. Ultimately, XAI empowers researchers and clinicians to tackle the complexities of obesity more effectively, leading to improved prevention and treatment strategies.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2023, October 10). Unveiling Obesity Determinants: Interpretable Machine Learning at the County Level. AZoAi. Retrieved on September 19, 2024 from https://www.azoai.com/news/20231010/Unveiling-Obesity-Determinants-Interpretable-Machine-Learning-at-the-County-Level.aspx.

  • MLA

    Chandrasekar, Silpaja. "Unveiling Obesity Determinants: Interpretable Machine Learning at the County Level". AZoAi. 19 September 2024. <https://www.azoai.com/news/20231010/Unveiling-Obesity-Determinants-Interpretable-Machine-Learning-at-the-County-Level.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Unveiling Obesity Determinants: Interpretable Machine Learning at the County Level". AZoAi. https://www.azoai.com/news/20231010/Unveiling-Obesity-Determinants-Interpretable-Machine-Learning-at-the-County-Level.aspx. (accessed September 19, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2023. Unveiling Obesity Determinants: Interpretable Machine Learning at the County Level. AZoAi, viewed 19 September 2024, https://www.azoai.com/news/20231010/Unveiling-Obesity-Determinants-Interpretable-Machine-Learning-at-the-County-Level.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Machine Learning Boosts Corrosion Predictions for Steel Structures