In an article published in Scientific Reports, researchers from Iran proposed a novel workflow that uses an enhanced weighted average ensemble approach with error-correcting output code (ECOC) and cost-sensitive learning (CSL) techniques to produce high-resolution lithology logs from conventional well logs. They demonstrated that the developed workflow can accurately predict underground lithofacies and outperform commonly used machine learning (ML) algorithms. The research addresses the challenges of multiclass imbalanced data classification and scalability, which arise from the complex geological heterogeneities and the large volume of data.
Background
Lithology logs are graphical representations of the subsurface rock types encountered during drilling operations. They provide valuable information for geologists and engineers to evaluate and correlate different formations. Well logs are measurements of the physical properties of subsurface rocks, such as gamma ray, neutron porosity, density, sonic, and photoelectric factor. These properties can reflect the changes in lithology, texture, and structure of the rocks, and thus can be used to gather the lithofacies or the rock units that share similar characteristics and depositional environments.
However, well logs can also be affected by other factors, such as salinity, fluid content, diagenesis, fractures, and clay composition, which can complicate and non-linearize the relationship between well logs and lithofacies. Moreover, the distribution of lithofacies can be highly imbalanced due to the natural heterogeneity of the subsurface, which poses a challenge for traditional ML algorithms that assume balanced data and focus on accuracy.
About the Research
The paper introduces a workflow that relies on an enhanced weighted average ensemble approach, which combines several baseline ML algorithms into a larger one with better performance and stability. Support vector machine (SVM), random forest (RF), decision tree (DT), logistic regression (LR), and extreme gradient boosting (XGBoost) are some widely used baseline algorithms in lithofacies classification.
This study addresses the challenge of multiclass imbalanced data by using two strategies: (1) decomposing the multiclass problem into binary subproblems using ECOC and (2) applying CSL to assign different weights and penalties to different classes according to their importance and rarity.
The authors used a dataset from a Middle Eastern oilfield, which consists of conventional well logs and lithology logs from five wells. They selected one well as a blind well and use the data from the other four wells to train and test the ML algorithms. The lithology logs are divided into seven classes: shale, limestone, argillaceous limestone, chalky limestone, cherty limestone, pyritic limestone, and shaly limestone. The dataset exhibits a significant imbalance among the classes, with shale being the most dominant and pyritic limestone being the rarest.
Methodologies Used
Researchers used the following three-step workflow:
Data preparation: They checked for missing values and outliers, encoded categorical features, removed unnecessary columns, standardized the data, applied linear discriminant analysis (LDA) as a noise reduction technique, and split the dataset into train, test, and blind verification sets.
Model training: The hyperparameters of the baseline algorithms were tuned using grid search and cross-validation, and then trained with different combinations of ECOC and CSL. Further, a voting ensemble classifier and an enhanced weighted average ensemble classifier were also trained.
Model evaluation: Two metrics were used to assess the performance of the models, mean F-measure and mean Kappa statistic, which are suitable for imbalanced multiclass data. The results of the models on both the test set and the blind set were compared to identify the optimal workflow that achieves the best performance.
Research Findings
The outcomes show that the enhanced weighted average ensemble of SVM and RF coupled with ECOC and CSL performs the best among all the models, with a mean F-measure of 91.04% and a mean Kappa statistic of 84.50% on the blind set. Moreover, the ECOC and CSL strategies were effective in handling multiclass imbalanced data, as they improved the performance of the baseline algorithms and the ensemble models.
Furthermore, the generated lithology log by the optimal workflow had a reasonable similarity to the original one and could accurately identify the minority classes, such as shale and pyritic limestone, which are often neglected by other models. This also provides a graphical comparative assessment of the generated lithology log and the original log and demonstrates the robustness and reliability of the optimal workflow.
The study has several applications for the geo-energy industry, as it provides a reliable and automated solution for generating high-resolution lithology logs from conventional well logs. It can help characterize and evaluate subsurface reservoirs and can also be applied to other fields and domains that deal with multiclass imbalanced data, such as image processing, medical diagnosis, and fraud detection.
Conclusion
In summary, the paper presents a versatile and robust workflow for lithofacies classification and generating high-resolution lithology logs from conventional well logs. This workflow can handle the challenges of multiclass imbalanced data and scalability. Moreover, workflow performance is checked by a real-time dataset from a Middle Eastern oilfield with complex and heterogeneous geological formations.
In the future, researchers suggest using other ML techniques, such as deep learning and transfer learning, for lithofacies classification tasks, and considering other sources of uncertainty and noise in the data.