In a paper submitted to the ArXiv* server, researchers proposed a new framework for automatically evaluating large language models (LLMs) on protocol planning in biology.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
In biology, conventional manual techniques used for research are extremely prone to human error, labor-intensive, and time-consuming. Robotic laboratory automation can increase scalability, reproducibility, and accuracy to ensure a more rapid transition from research to real-world application and more scientific breakthroughs.
Automated laboratory protocol generation is a crucial step towards biology research automation. Subsequently, these laboratory protocols can be converted into a robot code. LLMs can formulate precise scientific protocols due to their substantial latent scientific knowledge. However, no effective method currently exists to evaluate the generated scientific protocol accuracy except manual evaluation. The absence of established evaluation metrics is hindering the progress of automating science.
Laboratory protocol evaluation is difficult as protocols are extremely sensitive to minute details and even slight instruction variations can result in substantially different outcomes. Additionally, the same protocol can be correctly described at different granularity levels. For instance, the same technique of sequencing library preparation can be described by multiple paragraphs or a single line. This variability in granularity increases the difficulty of evaluating the LLM-generated protocol accuracy.
The proposed approach
In this study, researchers proposed a new framework BioPlanner to automatically evaluate the ability of LLMs to plan experimental protocols/write biological protocols. The objective was to evaluate protocol generation on pseudocode in place of free text instructions.
They also introduced BIOPROT, a dataset of biology protocols with corresponding pseudocode representations. The dataset allows model performance evaluation on several tasks, such as full protocol generation. Researchers automatically designed and executed a lab experiment using GPT-4 to assess the effectiveness of the dataset. Initially, a natural language protocol was converted into a pseudocode using an LLM, and then the ability of an LLM to reconstruct the pseudocode from a list of admissible pseudocode functions and a high-level description was evaluated.
In this study, researchers evaluated the ability of GPT-3.5 and GPT-4 LLMs to plan experimental protocols. The utility of pseudocode representations of text was validated externally by generating precise novel protocols using the retrieved pseudocode and the generated protocol was run in the biological laboratory.
The automated approach was inspired by robotic planning in which a closed admissible action set was provided to a controller agent. GPT-4 was utilized to automatically convert a written protocol into pseudocode using a protocol-specific pseudofunction set generated by the model.
A teacher model generated the admissible action set and the correct answer in the form of a step-by-step pseudocode. This information was used to assess the performance of a student model that generated an experimental protocol from a list of appropriate pseudocode functions and a short, high-level description of the protocol.
Thus, the proposed approach allowed the automatic conversion of a scientific protocol writing process into a series of multiple-choice questions/selecting a pseudofunction from a provided set, which can be assessed more effectively compared to natural language generation.
Experimental evaluation and findings
The LLM’s capabilities to reason about and generate scientific protocols were evaluated on several tasks using the BIOPROT dataset. Specifically, the model’s ability to identify the pseudofunction corresponding to the next step in the experimental protocol correctly from a partially completed pseudocode, an admissible set of pseudofunctions, and a protocol title was evaluated.
The correctness of both the predicted function and the function arguments was investigated by researchers. Additionally, the ability of the model to generate pseudocode corresponding to a given protocol description and title and an admissible set of pseudofunctions was also evaluated based on the correctness of the predicted functions and their corresponding arguments.
Moreover, researchers also evaluated the LLM’s ability to correctly identify the functions required to execute a protocol from a set of pseudofunctions and a protocol title and description. In the next-step prediction, GPT-4 consistently outperformed GPT-3.5 in the correct next-step prediction, while GPT-3.5 displayed better performance at predicting function arguments.
In protocol generation, GPT-4 significantly outperformed GPT-3.5. However, both GPT-3.5 and GPT-4 demonstrated similar recall and precision of used functions, which indicated that both LLMs possess a similar ability to use the correct functions, with GPT-4 performing better at utilizing the right order.
Although GPT-4 outperformed GPT-3.5 in functional retrieval, the overall results of both LLMs were generally unsatisfactory. In the real-world validation experiment that assessed the feasibility of using BIOPROT for accurate novel protocol generation, the LLM was prompted to identify relevant psueodofunctions from other protocols and generate accurate pseudocode.
A Toolformer-like chain-of-thought LLM agent with access to a tool to search for protocols in the BIOPROT database and used the GPT-4 LLM was created for this experiment. Initially, the agent was prompted to retrieve protocols relevant to new target protocol generation. Then, the pseudofunctions from the retrieved protocols were extracted by researchers, who then prompted the LLM agent to generate a new protocol using the retrieved pseudofunctions.
Two experiments were created using this setup, including culturing Symbiodinum, extracting its deoxyribonucleic acid (DNA), and running the DNA on an agarose gel; and culturing a single Escherichia coli bacteria colony overnight and synthesizing a glycerol stock with the suspension.
The real-world validation results displayed that the two new protocols generated by the LLM using pseudofunctions from the BIOPROT database were sufficient and accurate for a competent lab scientist to follow and were successfully executed in a laboratory.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
O’Donoghue, O., Shtedritski, A., Ginger, J., Abboud, R., Ghareeb, A. E., Booth, J., Rodriques, S. G. (2023). BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology. ArXiv. https://doi.org/10.48550/arXiv.2310.10632, https://arxiv.org/abs/2310.10632