In a paper published in the journal Scientific Reports, the authors proposed a novel task of translating between drug molecules and their corresponding indications to revolutionize the drug discovery process by enabling the generation of drugs targeting specific diseases. This would ultimately lead to better treatments for patients. The study evaluated nine T5 large language model (LLM) variations on this task using two public datasets from the Chemical Biology database (ChEMBL) and DrugBank.
The experiments showed initial success in using LLMs for translation tasks, identifying limitations, and proposing future improvements. This approach aims to streamline disease targeting and cut drug discovery costs by generating molecules from indications or vice versa, representing a significant advancement in applying generative artificial intelligence (AI) to medicine.
Advancing Drug Discovery
Previous efforts in drug discovery have focused on automating processes to reduce costs and improve efficacy. LLMs like generative pre-trained transformer 3 (GPT-3) and language model for molecular generation and analysis (LLaMA) have shown promise in various Natural Language Processing (NLP) tasks, including translating between drug molecules and their indications.
This translation, facilitated by textual representations like the simplified molecular-input line-entry system (SMILES), aims to streamline disease targeting and lower drug discovery expenses. AI methods, including graph neural networks and generative AI, along with advancements in molecular representation techniques, enhance drug discovery efficiency, showcasing AI's potential in designing drugs for complex biological processes.
Dataset Selection and Experimentation
The study leverages datasets from DrugBank and ChEMBL, chosen for their distinct representations of drug indications. DrugBank offers detailed descriptions of drug usage, while ChEMBL provides lists of medical conditions treated by each drug. With the data collected, experiments are conducted using the molecular transformer 5 (MolT5) model, fine-tuning it for tasks related to translating between drug indications and SMILES strings.
The dataset comprises pairs of drug indications and corresponding SMILES strings extracted from DrugBank and ChEMBL. Access to DrugBank's data required special permission, while ChEMBL data was accessible, necessitating a local database setup for parsing. Models utilized in the experiments are based on the T5 architecture, pre-trained on natural language text and molecular data. Initial experiments evaluate MolT5's baseline performance on drug-to-indication and indication-to-drug tasks, followed by fine-tuning and evaluating subsets of the data.
The study also explores integrating a custom tokenizer with the MolT5 architecture to improve model understanding of SMILES strings. The custom tokenizer, adapted from previous work on transformers for SMILES strings, decomposes input into grammatically valid components.
MolT5's pretraining with the custom tokenizer is conducted on the ZINC dataset, followed by evaluation on the DrugBank and ChEMBL datasets. The experiments encompass fine-tuning, evaluating subsets without fine-tuning, and evaluating the entire datasets, aiming to assess model performance under different conditions.
MolT5 Model Evaluation
The study employed evaluation metrics such as bilingual evaluation understudy (BLEU), recall-oriented understudy for gisting evaluation (ROUGE), and metrics in the assessment of translation with explicit ordering (METEOR) to assess the performance of the MolT5 models in translating between drug indications and SMILES strings. These metrics provided quantitative measures of the quality and accuracy of the model outputs, aiding in evaluating the effectiveness of different model configurations and training strategies.
The analysts provided examples of inputs and model outputs for both tasks using the large MolT5 model and ChEMBL data. The results indicated the model's ability to produce valid molecules and meaningful indications, albeit with some misspellings attributed to the model's size limitations.
The study observed that larger models outperformed smaller ones across all metrics, with the best results obtained on the 20% subset data for DrugBank and ChEMBL datasets. Fine-tuning experiments yielded inferior results, possibly due to noise introduced during training on indications.
In the indication-to-drug experiments, larger models exhibited superior performance, while fine-tuning on new data worsened performance. The study attributed this to the introduction of noise during fine-tuning. The evaluation of MolT5 pre-trained with a custom tokenizer for drug-to-indication and indication-to-drug tasks generally showed better performance on DrugBank data, with mixed results observed for indication-to-drug tasks.
Fine-tuning did not consistently affect performance, with some metrics showing improvements. Overall, the study underscored the importance of model size and training strategies in achieving optimal performance in drug discovery-related tasks.
Conclusion
To sum up, the study evaluated the performance of MolT5 models in translating between drug indications and SMILES strings using metrics such as BLEU, ROUGE, and METEOR. Results showcased the models' capability to produce valid molecules and meaningful indications, with larger models outperforming smaller ones. However, fine-tuning experiments yielded mixed results, emphasizing the importance of careful model selection and training strategies in drug discovery tasks.