Unleashing AI's Potential: GPT-4 Empowers Knowledge Extraction in Synthetic Biology

Extracting knowledge from journals in the field of synthetic biology for machine learning (ML) applications is a demanding and time-consuming endeavor.  In a recent submission to the bioRxiv* server, researchers introduced the use of GPT-4, a natural language processing (NLP) tool, to automate information extraction from 176 publications. The extracted data, comprising 2037 instances, is uploaded to an online database. This study demonstrated the effectiveness of the random forest (RF) model in predicting fermentation titers for Yarrowia lipolytica, achieving high accuracy (R2 = 0.86) even with unseen data. Additionally, transfer learning enables predictions for nonconventional yeasts like Rhodosporidium toruloides.

Study: Unleashing AI
Study: Unleashing AI's Potential: GPT-4 Empowers Knowledge Extraction in Synthetic Biology. Image credit: TeeStocker/Shutterstock

*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

Synthetic biology often relies on trial and error due to the complexity of biological systems. AI can potentially enhance the design-build-test-learn (DBTL) cycles by leveraging ML to learn from published results. However, ML requires extensive experimental data, making knowledge mining from journal articles a cost-effective strategy. NLP can organize and extract information from published articles. The release of GPT-4 enables the extraction of bioprocess features and outcomes from articles for database growth. Even with limited data, ML and transfer learning algorithms can predict microbial cell factories.

The current study demonstrates the workflow involved in the use of GPT-4 for data extraction from articles focusing on Yarrowia lipolytica.  Additionally, transfer learning techniques are employed to gain insights into Rhodosporidium toruloides. The extracted data is uploaded to an online database, facilitating AI applications in biomanufacturing design. The integration of GPT-4 with ML offers valuable insights for improving future AI applications in synthetic biology.

Methods

In this study, GPT-3.5 and GPT-4 versions were used for data extraction. Data extraction was performed semi-automatically using a standardized workflow. The analysis included feature variances, importance determination, and clustering. Data preprocessing and ML techniques were applied, including imputation, normalization, encoding, and correlation analysis. Transfer learning involved a pre-trained encoder and a random forest model. Various loss functions and stopping criteria were also used. Statistical tests included means and standard deviations, and a one-tail student t-test was performed.

Results and discussion

ML features and dataset extraction from synthetic biology papers: Sustainable biomanufacturing relies on the development of synthetic biology tools, which necessitate iterative design-build-test-learn (DBTL) cycles. Mechanistic models face challenges in simulating realistic microbial production, while ML can leverage previous knowledge. However, data-driven approaches rely on a substantial amount of experimental data. To address this, a database constructed from published papers can support ML applications. Data extraction is challenging due to sparse and inconsistent reporting. To overcome this challenge, GPT-4 is employed to automate the extraction process.

GPT-4 for data mining: A standardized workflow is employed for GPT-4 data collection. Sections of scientific articles are separated into text files, and prompts are added depending on the content. GPT-4 condenses the information into tables, facilitating focused verification by human reviewers. Although human supervision is still necessary, the accuracy of GPT-4 is improving over time.

A case study for data extraction: Using GPT-3.5 and GPT-4, data extraction was performed on Rhodosporidium fermentation articles. The extracted data achieved an accuracy of 74% with GPT-3.5, which increased to 89% with minor user discretion. Notably, the extracted data using GPT-4 showed no errors, showcasing the model's capabilities. A large amount of data was obtained from Rhodosporidium articles.

GPT-4 for Yarrowia lipolytica biomanufacturing database: Using the GPT-4 workflow, new data instances from Yarrowia papers were efficiently extracted.  The extracted data was organized into features for titer prediction. Feature analysis revealed similar patterns between manually extracted and GPT-extracted data. GPT-4 demonstrated an ability to capture more distinctiveness in the data.

Predicting Y. lipolytica fermentation titer using GPT: A comprehensive Y. lipolytica bioproduction database was created. Various ML algorithms were compared, and RF achieved the best accuracy with an R2 of 0.86 on unseen test instances. The RF model performed well for different product classes.

Prediction of nonmodel yeast factories using transfer learning: Transfer learning was employed to predict the performance of non-model yeast factories, such as R. toruloides, by leveraging knowledge from the Yarrowia dataset. An encoder-decoder neural network and an instance-based transfer learning approach were utilized. Feature reduction through encoding had limitations, while a random forest transfer learning method showed more accurate predictions for astaxanthin production. The current study highlighted the importance of genetic variations beyond the number of expressed genes in determining potential production.

ML algorithms and future applications: GPT-4 can make predictions for common biomanufacturing products but has limitations beyond its database. Transfer learning is effective for exploring the unknown. The encoder-decoder structure had unsatisfactory results, while the RF with the instance-transfer method showed reasonable generalization. Deep learning methods can improve tabular data representation. Multi-omics features, and AI-driven data acquisition can enhance model explainability. AI can support techno-economic analyses and computational strain design. The AI-extracted data was uploaded to ImpactDB for community contributions.

Conclusion

In conclusion, NLP and AI have the potential to revolutionize scientific research by automating knowledge mining and accelerating innovation in biology research. AI tools like GPT-4 can expedite data processing and ML. They automate knowledge extraction, benefiting ML applications and saving literature review and analysis time. Generative AI holds promise for automating analysis, optimizing fermentations, predicting outcomes, and accelerating innovation in biology research.

*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Dr. Sampath Lonka

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Lonka, Sampath. (2023, July 13). Unleashing AI's Potential: GPT-4 Empowers Knowledge Extraction in Synthetic Biology. AZoAi. Retrieved on September 18, 2024 from https://www.azoai.com/news/20230713/Unleashing-AIs-Potential-GPT-4-Empowers-Knowledge-Extraction-in-Synthetic-Biology.aspx.

  • MLA

    Lonka, Sampath. "Unleashing AI's Potential: GPT-4 Empowers Knowledge Extraction in Synthetic Biology". AZoAi. 18 September 2024. <https://www.azoai.com/news/20230713/Unleashing-AIs-Potential-GPT-4-Empowers-Knowledge-Extraction-in-Synthetic-Biology.aspx>.

  • Chicago

    Lonka, Sampath. "Unleashing AI's Potential: GPT-4 Empowers Knowledge Extraction in Synthetic Biology". AZoAi. https://www.azoai.com/news/20230713/Unleashing-AIs-Potential-GPT-4-Empowers-Knowledge-Extraction-in-Synthetic-Biology.aspx. (accessed September 18, 2024).

  • Harvard

    Lonka, Sampath. 2023. Unleashing AI's Potential: GPT-4 Empowers Knowledge Extraction in Synthetic Biology. AZoAi, viewed 18 September 2024, https://www.azoai.com/news/20230713/Unleashing-AIs-Potential-GPT-4-Empowers-Knowledge-Extraction-in-Synthetic-Biology.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Boost Machine Learning Trust With HEX's Human-in-the-Loop Explainability