Extracting knowledge from journals in the field of synthetic biology for machine learning (ML) applications is a demanding and time-consuming endeavor. In a recent submission to the bioRxiv* server, researchers introduced the use of GPT-4, a natural language processing (NLP) tool, to automate information extraction from 176 publications. The extracted data, comprising 2037 instances, is uploaded to an online database. This study demonstrated the effectiveness of the random forest (RF) model in predicting fermentation titers for Yarrowia lipolytica, achieving high accuracy (R2 = 0.86) even with unseen data. Additionally, transfer learning enables predictions for nonconventional yeasts like Rhodosporidium toruloides.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Synthetic biology often relies on trial and error due to the complexity of biological systems. AI can potentially enhance the design-build-test-learn (DBTL) cycles by leveraging ML to learn from published results. However, ML requires extensive experimental data, making knowledge mining from journal articles a cost-effective strategy. NLP can organize and extract information from published articles. The release of GPT-4 enables the extraction of bioprocess features and outcomes from articles for database growth. Even with limited data, ML and transfer learning algorithms can predict microbial cell factories.
The current study demonstrates the workflow involved in the use of GPT-4 for data extraction from articles focusing on Yarrowia lipolytica. Additionally, transfer learning techniques are employed to gain insights into Rhodosporidium toruloides. The extracted data is uploaded to an online database, facilitating AI applications in biomanufacturing design. The integration of GPT-4 with ML offers valuable insights for improving future AI applications in synthetic biology.
Methods
In this study, GPT-3.5 and GPT-4 versions were used for data extraction. Data extraction was performed semi-automatically using a standardized workflow. The analysis included feature variances, importance determination, and clustering. Data preprocessing and ML techniques were applied, including imputation, normalization, encoding, and correlation analysis. Transfer learning involved a pre-trained encoder and a random forest model. Various loss functions and stopping criteria were also used. Statistical tests included means and standard deviations, and a one-tail student t-test was performed.
Results and discussion
ML features and dataset extraction from synthetic biology papers: Sustainable biomanufacturing relies on the development of synthetic biology tools, which necessitate iterative design-build-test-learn (DBTL) cycles. Mechanistic models face challenges in simulating realistic microbial production, while ML can leverage previous knowledge. However, data-driven approaches rely on a substantial amount of experimental data. To address this, a database constructed from published papers can support ML applications. Data extraction is challenging due to sparse and inconsistent reporting. To overcome this challenge, GPT-4 is employed to automate the extraction process.
GPT-4 for data mining: A standardized workflow is employed for GPT-4 data collection. Sections of scientific articles are separated into text files, and prompts are added depending on the content. GPT-4 condenses the information into tables, facilitating focused verification by human reviewers. Although human supervision is still necessary, the accuracy of GPT-4 is improving over time.
A case study for data extraction: Using GPT-3.5 and GPT-4, data extraction was performed on Rhodosporidium fermentation articles. The extracted data achieved an accuracy of 74% with GPT-3.5, which increased to 89% with minor user discretion. Notably, the extracted data using GPT-4 showed no errors, showcasing the model's capabilities. A large amount of data was obtained from Rhodosporidium articles.
GPT-4 for Yarrowia lipolytica biomanufacturing database: Using the GPT-4 workflow, new data instances from Yarrowia papers were efficiently extracted. The extracted data was organized into features for titer prediction. Feature analysis revealed similar patterns between manually extracted and GPT-extracted data. GPT-4 demonstrated an ability to capture more distinctiveness in the data.
Predicting Y. lipolytica fermentation titer using GPT: A comprehensive Y. lipolytica bioproduction database was created. Various ML algorithms were compared, and RF achieved the best accuracy with an R2 of 0.86 on unseen test instances. The RF model performed well for different product classes.
Prediction of nonmodel yeast factories using transfer learning: Transfer learning was employed to predict the performance of non-model yeast factories, such as R. toruloides, by leveraging knowledge from the Yarrowia dataset. An encoder-decoder neural network and an instance-based transfer learning approach were utilized. Feature reduction through encoding had limitations, while a random forest transfer learning method showed more accurate predictions for astaxanthin production. The current study highlighted the importance of genetic variations beyond the number of expressed genes in determining potential production.
ML algorithms and future applications: GPT-4 can make predictions for common biomanufacturing products but has limitations beyond its database. Transfer learning is effective for exploring the unknown. The encoder-decoder structure had unsatisfactory results, while the RF with the instance-transfer method showed reasonable generalization. Deep learning methods can improve tabular data representation. Multi-omics features, and AI-driven data acquisition can enhance model explainability. AI can support techno-economic analyses and computational strain design. The AI-extracted data was uploaded to ImpactDB for community contributions.
Conclusion
In conclusion, NLP and AI have the potential to revolutionize scientific research by automating knowledge mining and accelerating innovation in biology research. AI tools like GPT-4 can expedite data processing and ML. They automate knowledge extraction, benefiting ML applications and saving literature review and analysis time. Generative AI holds promise for automating analysis, optimizing fermentations, predicting outcomes, and accelerating innovation in biology research.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Xiao, Z., Li, W., Moon, H., Roell, G. W., Chen, Y., & Tang, Y. J. (2023). Generative artificial intelligence GPT-4 accelerates knowledge mining and machine learning for synthetic biology. bioRxiv. DOI: https://doi.org/10.1101/2023.06.14.544984, https://www.biorxiv.org/content/10.1101/2023.06.14.544984v1