In an article published in the journal Machine Learning Science and Technology, researchers explored the use of tree-based machine learning (ML) algorithms to predict the formation energy of impurities in two-dimensional (2D) materials. They employed various regression models and integrated chemical and structural features, including Jacobi–Legendre polynomials, to enhance prediction accuracy.
The authors found that including structural features improved the prediction accuracy, with better results for adsorbates compared to interstitial defects. This approach reduced computational costs while providing valuable insights into impurity properties.
Background
Nano-structured materials, particularly impurity structures in 2D materials, are pivotal in various applications like optical quantum technologies, energy storage, and sensing. Understanding the formation energy of these impurities is crucial, as it indicates the stability and feasibility of the nano-structure. Traditional methods like density functional theory (DFT) for calculating formation energy are time-consuming, highlighting the need for more efficient approaches.
Recent advances in ML offer a promising alternative. ML algorithms, including decision tree regression and gradient boosting methods, have shown potential in predicting material properties by leveraging physics-inspired descriptors. Despite these advancements, existing methods often lack integration of both chemical and structural features, which limits prediction accuracy and interpretability.
This paper addressed these gaps by employing ML techniques to predict the formation energy of impurities in 2D materials using a novel combination of chemical and structural features derived from Jacobi–Legendre polynomials. This approach enhanced prediction accuracy and reduced computational costs, offering a valuable tool for materials science research.
Methodological Approach and Data Analysis
The methodology for predicting the formation energy of impurities in 2D materials involved several key steps.
Data preprocessing: Data from the impurities in 2D materials (IMP2D) database, containing 14,662 samples of adsorbate and interstitial impurities in 44 host materials, was used. Each sample's formation energy was computed using DFT with the Perdew, Burke, and Ernzerhof (PBE) exchange-correlation functional. Due to its computational efficiency and consistency, PBE was chosen despite its limitations. Samples with outlier formation energies or convergence issues were filtered, leaving 5,906 samples for analysis.
Feature creation: Features were divided into chemical and structural types. Chemical features included properties like atomic radius and electronegativity, while structural features were derived from Jacobi–Legendre polynomials, capturing the spatial arrangement of atoms around impurities.
ML models: The authors employed tree-based ML algorithms: Random forest (RF) and various gradient boosting methods, namely, gradient boosting regression, histogram-based gradient boosting regression, and light gradient boosting machine(LightGBM). These models used decision trees to predict formation energy based on the prepared features. LightGBM, with its advanced techniques like gradient-based one-side sampling (GOSS), was particularly noted for its efficiency in handling large datasets.
Computational details: Data was split into training, test, and blind-test sets. The blind-test set included samples from molybdenum disulfide (MoS2) and tungsten diselenide (W2Se4) hosts, not used during model training. Models were optimized using cross-validation, and their performance was assessed with metrics such as coefficient of determination (R²), mean absolute error (MAE), and root mean square error (RMSE).
Results and Discussion
The features were categorized into three sets, chemical features only, both chemical and structural features, and a subset of chemical features excluding certain parameters. Chemical features, such as chemical potential and electronegativity, were expected to provide valuable insights due to their link with formation energy and structural stability. Structural features were incorporated to enhance model performance but removing some features (e.g., hostenergy/atom) did not significantly impact accuracy.
The ML models, including RF, gradient boosting regression, histogram gradient boosting regression, and LightGBM, were evaluated for their prediction accuracy using various metrics, such as RMSE, MAE, and R2. Combining chemical and structural features improved the models' performance, particularly in predicting the formation energy of adsorbates and interstitial defects. The results showed that LightGBM provided faster training times with comparable prediction accuracy to other models. Comparisons of RMSE scores and prediction times across different models demonstrated the efficiency and robustness of LightGBM for this task.
Conclusion
In conclusion, the researchers utilized tree-based ML algorithms to predict the formation energy of impurities in 2D materials, integrating chemical and structural features, including Jacobi–Legendre polynomials. They found that incorporating structural features improved accuracy, especially for adsorbates compared to interstitial defects.
The authors highlighted that while LightGBM provided the fastest training times with competitive prediction accuracy, overall predictions were effective without needing host-specific features. This approach reduced computational costs and offered valuable insights into impurity properties, showcasing the potential of combining physically meaningful features with ML for accurate predictions in materials science. Future work could explore additional properties and advanced feature integrations.
Journal reference:
- Aniwat Kesorn, et al. (2024). Formation energy prediction of neutral single-atom impurities in 2D materials using tree-based machine learning. Machine Learning Science and Technology, 5(3), 035039–035039. DOI: 10.1088/2632-2153/ad66ae, https://iopscience.iop.org/article/10.1088/2632-2153/ad66ae