In a paper published in the journal PLOS ONE, researchers proposed LGN, a graph neural network (GNN)-based fusion model for predicting protein-ligand binding affinity. Recognizing the limitations of existing methods that treated protein and ligand structures uniformly, LGN integrated ligand feature extraction, capturing local and global features within the complex.
Utilizing interaction fingerprints and combining ligand-based features, LGN demonstrated superior performance on the PDBbind 2016 core set compared to models relying solely on complex graph features. The study underscored the rationalization and generalization of LGN, validated through comprehensive experiments and comparisons with state-of-the-art methods. Additionally, the researchers employed ensemble learning techniques to enhance robustness and address data similarity.
Related Work
In the past, the prediction of binding affinity between proteins and ligands has been a central focus in virtual drug discovery. Traditional methods, reliant on molecular force fields and manual feature engineering, have given way to more contemporary end-to-end machine learning (ML) algorithms. These algorithms, evolving from random forest and support vector machines to deep neural networks, leverage molecular fingerprints, protein sequences, and crystal structures for protein-ligand binding affinity prediction. Despite advancements, existing models often need to pay more attention to data heterogeneity and the significant volume imbalance between proteins and ligands.
Graph-Based Affinity Prediction Study
Emphasizing the accuracy of three-dimensional crystal structures of proteins and ligands, researchers focused on efficiently converting them into graph format within GNN. The protein data bank binding (PDBbind) dataset, specifically the PDBbindv2016, was a meticulous source of protein-ligand complexes with experimentally determined binding affinities.
The dataset includes a 'general set' comprising all structures and a 'refined set' representing a high-quality subset. The 'core set' from the Chinese Academy of Sciences Face Database - 2016 (CASF-2016), consisting of 285 diverse protein-ligand complexes, was utilized as the test set to evaluate model performance. Additionally, to ensure the validity of the results, the more recent PDBbindv2020 was employed in supplementary analyses.
The fusion model developed for deep graph learning aimed to capture diverse information from protein-ligand complexes. Utilizing two molecular graphs, namely the complex graph and the ligand graph, allowed the extraction of complementary features. Complex graphs employed attention mechanisms and gate recurrent units, while ligand graphs incorporated the graph isomorphism network (GIN) architecture. The resulting complex and ligand features were then combined through a fusion framework, enhancing the overall predictive capabilities of the model.
Molecular fingerprints, including simple ligand–receptor interaction descriptor (SIFP), extended connectivity interaction features (ECIF), and circular fingerprints (CFP), were incorporated to provide additional information. Based on biochemical theories, these descriptors aimed to capture specific properties of ligands and protein-ligand interactions. Despite the sparse nature of fingerprints, the GNN’s embeddings offered a condensed, information-rich alternative.
Three datasets with varying sizes were derived from the PDBbind general set to investigate the impact of training set size and distribution. Ten-fold cross-validation was performed on these sets, emphasizing the importance of training set quality for ML methods. The assessment of similarities, considering protein sequences, ligand molecular fingerprints, and interaction fingerprints separately, provided insights into the model's sensitivity to data similarity. The study considered dissimilar and similar datasets, shedding light on the potential influence of data mining and hidden biases in established datasets like PDBbind.
In-Depth Fusion Model Analysis
Researchers conducted an ablation study to assess the significance of incorporating extra ligand information through the GIN. The study delves into the evaluation results of various fingerprint additions and their combinations. Emphasizing the need for an ablation study becomes crucial for fusion models, actively highlighting how distinct components influence the overall performance in ML.
The fusion model's performance is thoroughly analyzed using complex and ligand graphs. Molecular fingerprints, including simple ligand-receptor interaction descriptor (SIFP), extended connectivity interaction features (ECIF), and circular fingerprints (CFP), are introduced to enhance feature richness. The study highlights that incorporating complementary ligand information is essential for predicting binding affinity. Visualizing the effectiveness of different fingerprints demonstrates how specific combinations outperform individual molecular fingerprints.
The analysis presents a detailed exploration of various models to identify the most promising one. The evaluation considers stability and performance, leading to the selection of a fusion model (F_SE). The investigation actively explores the impact of dataset size and distribution on the model's performance, revealing consistent improvements with larger datasets. The exploration actively examines the model's stability concerning dataset similarity, emphasizing the need for continuous data accumulation to overcome potential inaccuracies.
The study compares the proposed model with existing methods, showcasing its superiority in classical machine learning and newer deep learning techniques. Furthermore, the approach tests its generalization using the PDBbindv2020 dataset. The discussion highlights potential future directions, such as combining structure-based affinity and semantic-based drug-target prediction for more efficient drug discovery. The broader applicability of GNN in drug discovery and computational biology is acknowledged, pointing toward advancements in various aspects like pocket prediction, protein-protein interactions, and interaction predictions in other biological contexts.
Conclusion
In summary, this study significantly advanced protein-ligand binding affinity prediction through a fusion model employing GNN. The research underscored the crucial role of incorporating ligand information, demonstrated the effectiveness of specific molecular fingerprint combinations, and carefully evaluated the model's stability and performance.
Notably, the fusion model exhibited superior outcomes compared to existing methods in classical ML and newer deep learning techniques. The study's insights into dataset impact and continuous data accumulation enhance the robustness of predictive models. Overall, these findings pave the way for more efficient drug discovery strategies and highlight the broader applications of GNNs in computational biology.