LLMs Transform Drug Discovery with Molecule Translation

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.May 22 2024

In a paper published in the journal Scientific Reports, the authors proposed a novel task of translating between drug molecules and their corresponding indications to revolutionize the drug discovery process by enabling the generation of drugs targeting specific diseases. This would ultimately lead to better treatments for patients. The study evaluated nine T5 large language model (LLM) variations on this task using two public datasets from the Chemical Biology database (ChEMBL) and DrugBank.

Overview of the methodology of the experiments: drug data is compiled from ChEMBL and DrugBank and utilized as input for MolT5. Our experiments involved two tasks: drug-to-indication and indication-to-drug. For the drug-to-indication task, SMILES strings of existing drugs were used as input, producing drug indications as output. Conversely, for the indication-to-drug task, drug indications of the same set of drugs were the input, resulting in SMILES strings as output. Additionally, we augmented MolT5 with a custom tokenizer in pretraining and evaluated the resulting model on the same tasks. Image Credit: https://www.nature.com/articles/s41598-024-61124-0

The experiments showed initial success in using LLMs for translation tasks, identifying limitations, and proposing future improvements. This approach aims to streamline disease targeting and cut drug discovery costs by generating molecules from indications or vice versa, representing a significant advancement in applying generative artificial intelligence (AI) to medicine.

Advancing Drug Discovery

Previous efforts in drug discovery have focused on automating processes to reduce costs and improve efficacy. LLMs like generative pre-trained transformer 3 (GPT-3) and language model for molecular generation and analysis (LLaMA) have shown promise in various Natural Language Processing (NLP) tasks, including translating between drug molecules and their indications.

This translation, facilitated by textual representations like the simplified molecular-input line-entry system (SMILES), aims to streamline disease targeting and lower drug discovery expenses. AI methods, including graph neural networks and generative AI, along with advancements in molecular representation techniques, enhance drug discovery efficiency, showcasing AI's potential in designing drugs for complex biological processes.

Dataset Selection and Experimentation

The study leverages datasets from DrugBank and ChEMBL, chosen for their distinct representations of drug indications. DrugBank offers detailed descriptions of drug usage, while ChEMBL provides lists of medical conditions treated by each drug. With the data collected, experiments are conducted using the molecular transformer 5 (MolT5) model, fine-tuning it for tasks related to translating between drug indications and SMILES strings.

The dataset comprises pairs of drug indications and corresponding SMILES strings extracted from DrugBank and ChEMBL. Access to DrugBank's data required special permission, while ChEMBL data was accessible, necessitating a local database setup for parsing. Models utilized in the experiments are based on the T5 architecture, pre-trained on natural language text and molecular data. Initial experiments evaluate MolT5's baseline performance on drug-to-indication and indication-to-drug tasks, followed by fine-tuning and evaluating subsets of the data.

The study also explores integrating a custom tokenizer with the MolT5 architecture to improve model understanding of SMILES strings. The custom tokenizer, adapted from previous work on transformers for SMILES strings, decomposes input into grammatically valid components.

MolT5's pretraining with the custom tokenizer is conducted on the ZINC dataset, followed by evaluation on the DrugBank and ChEMBL datasets. The experiments encompass fine-tuning, evaluating subsets without fine-tuning, and evaluating the entire datasets, aiming to assess model performance under different conditions.

MolT5 Model Evaluation

The study employed evaluation metrics such as bilingual evaluation understudy (BLEU), recall-oriented understudy for gisting evaluation (ROUGE), and metrics in the assessment of translation with explicit ordering (METEOR) to assess the performance of the MolT5 models in translating between drug indications and SMILES strings. These metrics provided quantitative measures of the quality and accuracy of the model outputs, aiding in evaluating the effectiveness of different model configurations and training strategies.

The analysts provided examples of inputs and model outputs for both tasks using the large MolT5 model and ChEMBL data. The results indicated the model's ability to produce valid molecules and meaningful indications, albeit with some misspellings attributed to the model's size limitations.

The study observed that larger models outperformed smaller ones across all metrics, with the best results obtained on the 20% subset data for DrugBank and ChEMBL datasets. Fine-tuning experiments yielded inferior results, possibly due to noise introduced during training on indications.

In the indication-to-drug experiments, larger models exhibited superior performance, while fine-tuning on new data worsened performance. The study attributed this to the introduction of noise during fine-tuning. The evaluation of MolT5 pre-trained with a custom tokenizer for drug-to-indication and indication-to-drug tasks generally showed better performance on DrugBank data, with mixed results observed for indication-to-drug tasks.

Fine-tuning did not consistently affect performance, with some metrics showing improvements. Overall, the study underscored the importance of model size and training strategies in achieving optimal performance in drug discovery-related tasks.

Conclusion

To sum up, the study evaluated the performance of MolT5 models in translating between drug indications and SMILES strings using metrics such as BLEU, ROUGE, and METEOR. Results showcased the models' capability to produce valid molecules and meaningful indications, with larger models outperforming smaller ones. However, fine-tuning experiments yielded mixed results, emphasizing the importance of careful model selection and training strategies in drug discovery tasks.

Journal reference:

Oniani, D., et al. (2024). Emerging Opportunities of Using Large Language Models for Translation Between Drug Molecules and Indications. Scientific Reports, 14:1, 10738. https://doi.org/10.1038/s41598-024-61124-0, https://www.nature.com/articles/s41598-024-61124-0

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2024, May 22). LLMs Transform Drug Discovery with Molecule Translation. AZoAi. Retrieved on July 05, 2025 from https://www.azoai.com/news/20240522/LLMs-Transform-Drug-Discovery-with-Molecule-Translation.aspx.
MLA
Chandrasekar, Silpaja. "LLMs Transform Drug Discovery with Molecule Translation". AZoAi. 05 July 2025. <https://www.azoai.com/news/20240522/LLMs-Transform-Drug-Discovery-with-Molecule-Translation.aspx>.
Chicago
Chandrasekar, Silpaja. "LLMs Transform Drug Discovery with Molecule Translation". AZoAi. https://www.azoai.com/news/20240522/LLMs-Transform-Drug-Discovery-with-Molecule-Translation.aspx. (accessed July 05, 2025).
Harvard
Chandrasekar, Silpaja. 2024. LLMs Transform Drug Discovery with Molecule Translation. AZoAi, viewed 05 July 2025, https://www.azoai.com/news/20240522/LLMs-Transform-Drug-Discovery-with-Molecule-Translation.aspx.