In resource-limited settings, streamlined, data-driven drug discovery remains a challenge. In a recent paper published in the journal Nature Communications, researchers proposed a solution, ZairaChem—a fully automated artificial intelligence (AI) and machine learning (ML) tool—for quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR) modeling.
Background
The cost of advancing new medicines from research to practical use has steadily increased over the decades. To expedite drug discovery and reduce failures, the industry has turned to AI and ML. There have been significantly increased investments in recent years in the areas of AI and ML. The integration of AI aims to transform the process from slow and risky to efficient and integrated. Also, it expedites the development of clinical candidates.
This potential extends to infectious diseases, often neglected in drug discovery. These diseases disproportionately affect lower-to-middle-income countries (LMICs), especially in Africa. Historically, international funding agencies have driven drug discovery efforts in LMICs. AI and ML can revitalize and accelerate projects in low-resource settings, but a lack of expertise and resources hinders progress.
The ZairaChem Framework
The authors proposed ZairaChem, an autonomous AutoML tool for robust QSAR and QSPR model training. There are several steps involved in the framework.
Data Collection: Bioactivity data for assays were sourced from H3D's curated database, spanning from 2010 to November 2021. Researchers carefully cleaned up each dataset to remove any confusing information in the biological results. They also removed compounds that showed big differences in repeated test results. The remaining compounds had their replicate values averaged, regardless of experimental date or site, utilizing H3D's chemical and assay data collected via Dotmatics software.
Cytochrome inhibition data for human cytochrome P450 (CYP) inhibition were gathered from PubChem and ChEMBL databases. PubChem BioAssay data were binary, while ChEMBL data considered compounds as active if bioactivity was less than 10 μM and inactive otherwise. Data not meeting these criteria was discarded.
The ZairaChem Pipeline: ZairaChem was used to create a model for each H3D assay. They used the standard settings and did this model-making process twice a year for all the assays in the virtual screening lineup.
Data Pre-processing: Data pre-processing involved multiple steps, including identifying relevant columns, retaining chemical structures represented as SMILES strings, and outcome columns. Various descriptors were calculated, and quantile normalization was performed for each molecule. The missing data were imputed using a nearest-neighbor approach. Invariant columns were removed, and different dimensionality reduction techniques were applied.
Small-Molecule Descriptors: ZairaChem accesses pre-trained ML models from the Ersilia Model Hub, generating numerical vectors for molecules and capturing their topological and physicochemical characteristics. Descriptors include Mordred parameters, 2D structural fingerprints (ECFP), Chemical Checker signatures, graph-based embeddings (GROVER), and chemical language model (ChemGPT) embeddings. Other descriptors from the Ersilia Model Hub can be specified as needed.
AutoML Methods: ZairaChem incorporates five automated machine learning (AutoML) methods, each enhancing specific pipeline features. These methods combine deep learning with tree-based and classical ML approaches. The first module identifies appropriate descriptor types for the task, using an AutoML framework (FLAML) for hyperparameter search. The second module focuses on the visual interpretation of the chemical space. The third module leverages the GROVER embedding with fine-tuning. The fourth module uses an image-based representation of molecules: MolMaps. The fifth module introduces a fully trained transformer network TabPFN classifier for small tabular classification tasks.
ZairaChem aggregates predictions from individual models into a consensus prediction using a weighted average, with weights based on estimated model performance. Prediction scores transform before aggregation. At the pipeline's conclusion, ZairaChem generates performance reports, including common validation metrics. A primary result spreadsheet includes prediction output and performance metrics, complemented by reporting plots such as the area under the receiver operating characteristic (AUROC) curve.
Web-Based Predictions: A web application interacts with light versions of H3D models, preserving over 95 percent of the original performance. The app provides classification scores and percentiles.
Results and Analysis
ZairaChem employs various chemical descriptors and an ensemble of AI/ML algorithms to achieve state-of-the-art performance without the need for manual intervention. This tool is pivotal for continuous integration and deployment in environments like H3D, where data science resources are limited. ZairaChem has systematically enhanced H3D's drug discovery pipeline, generating 15 production-ready models for key assays related to antimalarial and antitubercular screening cascades. These models consistently display excellent performance, with most AUROC scores surpassing 0.75.
The real-world applicability of ZairaChem is exemplified through its utilization in two active medicinal chemistry programs, facilitating the identification of lead-like compounds. The virtual screening cascade demonstrates the ability to efficiently prioritize novel compounds, leading to faster elucidation of structure-activity relationships in a resource-constrained environment. The models have displayed good agreement with experimental values, underscoring their utility in accelerating the discovery of promising compounds across various chemical series and disease areas.
Conclusion
In summary, a fully open-source AI-based QSAR and QSPR virtual screening pipeline has been developed and deployed. This pioneering effort to create a virtual screening cascade with African-produced data showcases the potential of AIML tools to support drug discovery in resource-constrained settings. Moreover, ZairaChem offers a competitive and user-friendly software solution for modeling small-molecule bioactivity data, making it accessible even to those without extensive data science skills.
Journal reference:
- Turon, G., Hlozek, J., Woodland, J.G., Kumar A., Chibale K., and Frigola M.D. (2023). First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa. Nature Communications 14, 5736. DOI: https://doi.org/10.1038/s41467-023-41512-2, https://www.nature.com/articles/s41467-023-41512-2