In a paper published in the journal Computational Materials Science, researchers introduced alloy bidirectional encoder representations from transformers (AlloyBERT) to predict alloy properties like elastic modulus and yield strength from textual inputs.
Utilizing the robustly optimized BERT approach (Roberta) and BERT encoder models with self-attention mechanisms, AlloyBERT achieved a lower mean squared error (MSE) on the multi-principal elemental alloys (MPEA) dataset and the refractory alloy yield strength dataset, outperforming traditional shallow models. This study highlighted the potential of language models in material science for accurate, text-based alloy property predictions.
Background
Past work in alloy discovery has highlighted the challenges of predicting alloy properties due to the vast number of possible combinations and the limitations of traditional methods like density functional theory (DFT) and machine learning (ML) models. Transformer-based models such as BERT and RoBERTa have shown potential in various fields, including materials science, for interpreting complex textual data and predicting material properties. However, the challenge remains in accurately breaking down and representing alloy data so that these models can effectively process and predict properties.
Model and Methodology
The model architecture is based on RoBERTa, a variant of BERT that employs a different pretraining method and has shown superior performance on several benchmarks. RoBERTa employs a transformer architecture that relies on a self-attention mechanism and is composed entirely of an encoder. This encoder features layers with a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. This capability improves the model's understanding of context and managing long-range dependencies.
The study utilizes two primary datasets: the MPEA dataset from Citrine informatics, which contains 1546 entries on mechanical properties and Young's modulus, and the refractory alloy yield strength (RAYS) dataset, with 813 entries detailing alloy composition and testing temperatures from prior literature. These datasets were processed to convert into textual descriptions, incorporating detailed information about elemental composition and other properties and facilitating comparison with shallow machine learning models.
The MPEA dataset was refined for preprocessing to remove irrelevant columns and convert string-type features into 1-hot encodings. The analysts parsed the chemical formulas of the alloys to create representations of their elemental composition. The RAYS dataset did not require additional cleaning. These preprocessing steps ensured effective training and evaluation of shallow models and prepared the data for comparison with AlloyBERT. Comprehensive textual descriptions were generated, providing detailed information from atomic to microstructural levels, which is crucial for the performance of downstream tasks.
The textual data was tokenized using a byte pair encoding (BPE) tokenizer, and RoBERTa was pre-trained with masked language modeling (MLM). Researchers masked a fraction of input tokens during MLM and trained the model to predict these masked tokens, utilizing dynamic masking to improve learning dynamics. Following the MLM phase, a regression head was added to RoBERTa to predict alloy properties. Researchers employed a linear learning rate scheduler to decrease the learning rate gradually.
Model Performance Evaluation
The analysts evaluated the model's performance against a range of shallow learning algorithms using MSE as the metric. Gradient boosting achieved the lowest MSE of 0.02376 on the MPEA dataset, while random forests had the lowest MSE of 0.01459 on the RAYS dataset.
The team compared it with a BERT encoder to benchmark the performance. The results indicated that more complex textual descriptions generally improved model accuracy. Specifically, the most elaborate descriptions with RoBERTa and BERT showed the lowest MSE, with RoBERTa achieving a minimum of 0.00015 for MPEA and BERT reaching 0.00042. However, this was with finetuning only for the most detailed description.
The RAYS dataset's performance improved significantly with pretraining and finetuning, especially for the most detailed description, achieving the lowest MSE of 0.00527 with BERT. Deviations from expected patterns, particularly with RoBERTa, suggest that the current pretraining could benefit from a broader corpus of alloy-related texts to enhance generalization and consistency. The high R² scores of 0.99 for MPEA and 0.84 for RAYS indicate that the model effectively captures the underlying patterns in the data.
Conclusion
To sum up, this work demonstrated the effectiveness of transformer models in predicting alloy properties using human-interpretable textual inputs. Although initial results showed unexpected MSE behavior with increasing text information in the MPEA dataset, the most detailed descriptions, combined with custom-trained tokenizers and pretraining and finetuning, ultimately achieved the lowest MSE of 0.00015. The most elaborate string descriptions for the RAYS dataset yielded the best results, with RoBERTa achieving a minimum MSE of 0.00611 and BERT reaching 0.00527.
The study also highlighted that the pretrain and finetune method significantly reduced MSE compared to Finetune, underscoring the importance of comprehensive textual inputs and custom tokenizers. The high R² scores of 0.99 for MPEA and 0.84 for RAYS confirmed the strong predictive capabilities of AlloyBERT, suggesting that transformer models when used with detailed textual inputs, advanced alloy property prediction.