In an article published in the journal Nature, researchers focused on using artificial intelligence (AI), specifically the CatBoost algorithm, to predict the transition temperatures (Tc) of superconducting materials. Utilizing the SuperCon dataset, the study included data pre-processing and introduced the Jabir and Soraya packages for generating atomic descriptors and selecting crucial features.
The resulting model achieved high accuracy with R-squared (R2) and root mean square error (RMSE) values of 0.952 and 6.45 K, respectively. Additionally, a web application for predicting Tc values of superconducting materials was introduced as a novel contribution.
Background
Superconductivity, characterized by zero electrical resistance and the expulsion of magnetic fields, results from quantum mechanics on a macroscopic scale. While predicting the Tc of superconductors remains challenging, existing methods, including density functional theory (DFT), face limitations in handling strong correlations. The rise of machine learning (ML) offers an alternative, with data-driven approaches proving advantageous for predicting material properties. Recent studies have employed various algorithms, such as extreme gradient boosting (XGBoost), random forest, and convolutional neural networks (CNN), to predict Tc values for superconducting materials, but gaps persist in establishing a comprehensive feature space and identifying crucial features.
This research addressed these gaps by emphasizing the significance of the dataset in data science and introducing the Jabir package to generate 322 atomic features, establishing a more suitable feature space for superconducting Tc. Additionally, the Soraya package aided in selecting the most relevant features. Previous studies utilized methods like Magpie descriptors or crystal graph CNNs but did not thoroughly focus on feature selection or creating an optimal feature space. The proposed model, utilizing the CatBoost algorithm, surpassed prior works with superior R2 and RMSE values. The emphasis on dataset refinement and feature selection distinguished this research, contributing to the evolution of "Data-Based Materials Science" and advancing the accurate prediction of superconducting material properties.
Data and computational methods
The researchers focused on predicting the Tc of superconducting materials using the CatBoost algorithm and a meticulously processed dataset, named DataG, derived from the SuperCon dataset. The dataset, containing 33,407 compounds, underwent extensive cleaning procedures, addressing issues like missing and duplicated data, problematic compounds, and outliers. The cleaning process resulted in the creation of DataG, a refined dataset comprising 13,022 compounds.
The CatBoost algorithm, a gradient-boosted decision trees ensemble technique, was chosen for its efficiency in handling large datasets. To represent compounds, a novel Python package called Jabir generated 322 atomic features for each, emphasizing the importance of the dataset in ML. Feature selection became crucial in handling the vast feature space, and the authors introduced the Soraya package, a hybrid method combining correlation analysis, Shapley additive explanations (SHAP) method, and forward selection. This innovative approach helped identify the most significant features while eliminating redundant ones.
The research leveraged these refined features to predict Tc values, achieving notable accuracy with an emphasis on the dataset's quality and feature selection. The comprehensive methodology, from dataset preprocessing to feature selection and ML application, contributed to advancing the understanding and prediction of superconducting material properties.
Results
The study employed an innovative hybrid technique, the Soraya package, to select 30 significant features from 322, emphasizing the importance of thermal conductivity in determining superconducting Tc. The CatBoost algorithm was then employed to sort these features, confirming the strong correlation (0.68) between thermal conductivity and Tc. For the refined dataset, DataG, comprising 13,022 superconducting materials, the CatBoost algorithm predicted Tc values with an impressive R2 of 0.952 and RMSE of 6.45, surpassing previous literature.
The methodology was extended to other datasets, DataS, DataK, and DataH, leading to improved evaluation criteria. The model demonstrated its predictive power by accurately estimating Tc values for new and previously unreported iron-based superconducting compounds. Notably, the model achieved remarkable agreement when predicting Tc values for compounds not present in the original dataset, validating its accuracy against experimental results. The study not only expanded the comprehension of superconducting material characteristics but also furnished a resilient and trustworthy ML model for Tc prediction.
Conclusion
In conclusion, the researchers leveraged AI, specifically the CatBoost algorithm, to predict the Tc of superconducting materials, presenting a novel approach in materials science. The development of the DataG dataset, consisting of 13,022 compounds, involved advanced data pre-processing techniques, while the newly designed Jabir package generated superior atomic features compared to existing methods.
The innovative Soraya package, as a feature selection method, significantly enhanced the prediction model by eliminating redundant features. This comprehensive approach resulted in optimized evaluation values for various datasets. The study's contributions, including the novel web application for Tc prediction, demonstrated the impactful synergy between AI and materials science.