In a recent paper published in the journal Scientific Reports, researchers evaluated the efficacy of ML algorithms in radiomics across varied clinical queries and identified reliable strategies regardless of datasets.
Background
Radiomics involves quantitatively extracting multiple features from medical images to uncover predictive and diagnostic biomarkers. This approach has shown the potential to yield results by utilizing ML techniques to reveal hidden insights from images. However, the absence of established norms presents an obstacle to deploying radiomic biomarkers in clinical environments.
A radiomics study comprises cohort constitution, image acquisition, region of interest (ROI) segmentation, feature extraction, modeling, and validation. Modeling involves feature selection and prediction, offering multiple methods and yielding varied combinations. No consensus on preferred algorithms for radiomics exists. While some test various algorithms, this risks false positives.
Despite initiatives such as the Radiomics Quality Score, adherence remains low. Biases can emerge from dataset differences, operator variability, and algorithm overfitting. The objective of this study is to identify the optimal combinations of algorithms for achieving stable radiomic performance. This aims to provide guidance in making modeling decisions and to underscore the key factors influencing performance.
Methodology overview
Investigating radiomic performance
To assess the impact of method and algorithm choices on model performance, ten datasets from published radiomics studies were employed. The research adhered to the Declaration of Helsinki, which is a set of ethical principles regarding human experimentation and received ethical approval. Various datasets, including COVID-19, sarcopenia, head and neck, uterine masses, and orbital lesion cases, were employed. These datasets included radiomic features extracted from varied imaging methods for binary diagnoses, with 97 to 693 patients and 105 to 606 features per sample.
The evaluation employed seven commonly used feature selection algorithms and 14 binary classifiers. Different feature selection algorithms and classification approaches were examined with a focus on maximizing the area under the receiver operating characteristic curve (AUC) through grid-search and nested cross-validation strategies. The procedure was performed on ten datasets, employing different train-test splits and varying numbers of selected features. This process yielded a total of 13,020 calculated AUCs.
Statistical analysis of the results
To assess the variability of AUC, researchers utilized the Multifactor Analysis of Variance (ANOVA) technique. Factors analyzed associated with AUC include dataset, feature selection, classifier, feature count, imaging modality, train-test split, and interaction effects. The impact of each factor or interaction is quantified using the proportion of variance. The results were presented in terms of frequencies or ranges.
Median AUC, 1st quartile (Q1), and 3rd quartile (Q3) were calculated for every feature selection, classifier, dataset, and split. Boxplots were used to illustrate outcomes. Friedman tests, followed by Nemenyi-Friedman tests, were utilized to compare the median AUCs of the algorithms. Additionally, heatmaps are employed to portray feature selection and classifier results.
Key Findings: The AUC values spanned 0.20 to 0.91 across all combinations, with 3.4% of values below 0.5. Multifactor ANOVA explained 55% of the variance in modeling performance by identifying factors and their interactions. The dataset contributed the most (17%), followed by the classifier (10%) and the train-test split (9%). Feature selection influenced 2%, while features and imaging modality impacted less than 1%. The remaining 17% is explained by the interactions between the factors.
Information theory-based methods such as Joint Mutual Information Maximization (JMIM) and Joint Mutual Information (JMI) excelled, ensuring consistently good outcomes. All feature selection algorithms outperformed random feature selection. Linear methods and random forests consistently performed well. Some algorithms, such as k-nearest neighbors (KNN) and XGBoost, showed occasional high performance.
The heatmap revealed that median AUC values ranged from 0.57 to 0.74 for all feature selection algorithms and classifiers. Optimal combinations involved top feature selection and classifier algorithms.
Key insights from the study
This study scrutinized diverse combinations of feature selection algorithms and classifiers across ten datasets. Dataset impact on performance variation was primary, reflecting data information quantity. Information theory-based feature selection algorithms consistently outperformed others, while ANOVA showed little feature selection influence. Classifier choice significantly impacted performance within a dataset. Some classifiers, such as Random Forest, Linear Discriminant Analysis, and Ridge Penalized Linear Regression, showed steady success. However, no single algorithm consistently outperformed others.
The findings of the current study aligned with prior research findings. Dataset features and sizes affect overfitting and generalizability. Nested cross-validation offers a more robust evaluation. Balancing optimization and overfitting are crucial when testing multiple models, preventing false discoveries. Algorithm similarity yields comparable outcomes. Balancing different classifier families may be an efficient strategy. The study used 10 datasets, enhancing generalizability, yet dataset similarities constrained certain analyses.
Conclusion
In summary, many diverse factors influence model variations in radiomics, including dataset, classifier, and train-test split. Testing a few feature selection and classifier combinations is advised to prevent false discoveries and overfitting. Information theory-based feature selection and penalized linear models with random forest classifiers performed consistently.