In a recent paper submitted to the arXiv* server, researchers presented a machine-learning approach for predicting early dropouts in an active and healthy aging app. The proposed algorithms were submitted to the International Federation of Medical and Biological Engineering (IFMBE) Scientific Challenge 2022. The results demonstrate that machine learning algorithms can offer high-quality adherence predictions.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
The IFMBE Scientific Challenge 2022 aimed to predict early dropouts by analyzing user acquisition patterns. Using machine learning algorithms, a binary classification task was used to forecast user adherence for three future acquisitions based on scheduled acquisitions and user characteristics.
Dataset preparation
Dataset and Features: The Moving Active and Healthy Aging (MAHA) dataset contains approximately 400 users with demographic features, acquisitions, and answered questionnaires organized in 10 tables. The data consist of users' acquisitions per activity, socio-demographic characteristics, acceptability of the application, application logs, and participants' quality of life. The questionnaires include the Self Perception Questionnaire (SPQ), Unified Theory of Acceptance and Use of Technology (UTAUT), EQ-5D-3L, and the University of California, Los Angeles (UCLA). The challenge comprises two phases, and the combined datasets from both phases are used.
Pre-processing: To prepare the data for machine learning, pre-processing is necessary. Users with different statuses and limited interactions are discarded for dynamic features (number of acquisitions per activity). The dataset is reduced to 463 users after data cleansing. The active periods for each user are calculated, and sessions are divided based on weekly intervals. Finally, the researchers produced 15-session sets using all possible linear combinations within each participant's active period, employing a sliding window algorithm. The resulting dataset comprised 84111 rows, representing session acquisitions. To obtain the corresponding target adherence, the last three acquisitions in each set were added together.
Static features: Questionnaires contain null values, and Cronbach's alpha assesses reliability. Demographic features show that most users are elderly with limited technological knowledge.
Final datasets and imbalance: After combining dynamic and static features, seven datasets are generated, with dataset six containing acquisitions and static information. The dataset is imbalanced, with most samples presenting low adherence. Oversampling techniques are used to reduce skewness. Null values are handled using mode imputation, and normalization is performed to improve model performance.
Issue of duplicate session tuples
The MAHA dataset poses a challenge with duplicate session data, where each 12-tuple session is limited to distinct values due to low or high acquisitions. This results in only 4096 unique values, whereas the dataset contains 3948 distinct rows out of 84111. Even considering the number of acquisitions per session (up to four), the dataset has 26924 unique 12-tuple sessions, causing samples to be included multiple times. This imbalance affects classification algorithms, leading to biased predictions favoring the majority class (low adherence). Addressing the duplicate data issue is crucial to improving model generalization and preventing concept-learning problems, enabling accurate classification of the minority class (high adherence).
Study Results
Local Evaluation: Various classification algorithms were employed, including Random Forest (RF), k-nearest neighbor (kNN), XGBoost, and Multi-Layer Perceptron (MLP). The classifiers were evaluated locally on each dataset using 10-fold cross-validation. The MLP and XGBoost models showed superior performance, correctly predicting high adherence. Conventional machine learning algorithms also performed well but were slower to train.
Feature importance: For the RF model, the last two acquisition sessions strongly influenced the classifier's behavior. In later datasets, the week's number and demographic features became more influential, indicating the importance of certain dates and the influence of age and technological level on adherence.
Oversampling techniques: To address the class imbalance, four oversampling techniques were used: random oversampling, Synthetic Minority Oversampling Technique (SMOTE), adaptive synthetic, and conditional tabular Generative Adversarial Network (CTGAN). The MLP model with oversampling provided better results, significantly improving the classification scores compared to the baseline. The SMOTE method outperformed other techniques.
Official challenge results: In Phases I and II of the challenge, different classifiers were submitted using various datasets. The MLP model performed consistently well in both phases, while XGBoost showed slightly better local evaluation but struggled to generalize. The ensemble method did not outperform the single MLP classifier. Oversampling techniques produced varying results, with SMOTE providing the highest-quality model and winning the challenge. There was a significant difference in classification performance between local evaluation and official results on some datasets, likely due to a dataset shift.
Conclusion
In summary, addressing the adherence problem has significant implications for improving the quality of life for elderly individuals and promoting healthy aging. The IFMBE Scientific Challenge 2022 involved predicting user adherence through binary classification. The researchers performed pre-processing techniques to generate final datasets and tested various binary classification methods. MLP and XGBoost models slightly outperformed conventional algorithms in local evaluation, but conventional models had faster training. Features other than consecutive session acquisitions did not significantly impact adherence prediction. Oversampling, especially with the SMOTE algorithm, yielded the highest classification performance. Challenges included dataset interpretation, content learning, noisy data, and imbalanced datasets. Future work aims to address these issues and analyze feature performance in the evaluation set.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Perifanis, V., Michailidi, I., Stamatelatos, G., Drosatos, G., and Efraimidis, P. S. (2023). Predicting Early Dropouts of an Active and Healthy Ageing App. arXiv. DOI: https://doi.org/10.48550/arXiv.2308.00539, https://arxiv.org/abs/2308.00539