AI Models Break Barriers in Protein Discovery and Supercharge Precision Medicine

Download PDF Copy

Technical University of DenmarkApr 1 2025

With InstaNovo and InstaNovo+, scientists can now decode unknown proteins faster and more accurately—reshaping the future of diagnostics, immunotherapy, and personalised treatments.

Research: InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments. Image Credit: S. Singha / Shutterstock

Researchers have developed new AI models that can vastly improve accuracy and discovery in protein science. Potentially, the models will assist the medical sciences in overcoming present challenges in personalised medicine, drug discovery, and diagnostics.

Most technical and natural science fields are advancing rapidly due to broadly available AI tools. This is particularly true in biotechnology, where AI models power breakthroughs in drug discovery, precision medicine, gene editing, food security, and many other research areas.

One sub-field is proteomics – the study of proteins on a large scale – where vast amounts of protein data are gathered in databases against which a sample can be compared. These databases enable scientists to discern which proteins are present in a sample and, thereby, in microorganisms. They allow a doctor to diagnose diseases, monitor the effectiveness of a treatment, or identify pathogens present in a patient's sample.

Although these tools are very useful and effective, there are limits to what they can do, says Timothy Patrick Jenkins, an Associate Professor at DTU Bioengineering and corresponding author:

"First off, no database includes everything, so you need to know which databases are relevant to your particular needs. Then deep searches are very time-consuming and demand a lot of computer power. And, finally, it's nearly impossible to identify proteins that haven't been registered yet."

For this reason, some groups have worked on so-called 'de novo sequencing algorithms' that improve accuracy and lower computational costs with increasing database size. Still, according to Jenkins and colleagues from DTU, Delft University in the Netherlands, and the British AI company InstaDeep, their performance remained "underwhelming."

ProteomeTools datasets and their PRIDE repository identifiers. Each dataset covers a unique set of synthetic peptides, derived from human protein sequences, which have been measured with MS. b, Overview of data extraction and preprocessing steps. Raw data were matched with the results of a database search with target-decoy FDR estimation (controlled at 1%) to create the training dataset of our models. c, IN model architecture. The model takes a mass spectrum as input, which is transformed to a latent embedding representation using multi-scale sinusoidal embeddings that encodes the intensity and m/z vectors. This is passed through L transformer encoder layers, each with multiple heads to derive a cross-attention representation of the peaks in the spectrum. Additional precursor information is included and concatenated to form the encoder output, which is cross-attended by L decoder layers. The precursor information may alternatively be encoded as the start-of-sequence token in the decoder. The decoder takes in an embedding of the partially decoded peptide sequence, and is responsible for predicting the next residue of the peptide. A knapsack beam search decoding is applied to ensure the model outputs a confident prediction that matches the precursor mass and charge. d, Overview over the iterative refinement model, IN+. The model features the IN encoder and a diffusion decoder, which iterates over sequence predictions in a series of timesteps, denoising and refining predictions using a multinomial probability distribution for discrete sequence prediction. t is the denoising timestep, x_t is the noised sequence at timestep t, x₀ is the denoised sequence where t = 0. p is the posterior distribution over x_t−1 given x_t.

Exceeding state-of-the-art

In a new paper in Nature Machine Intelligence, they propose two novel AI models to assist researchers, medical practitioners, and commercial entities in finding precisely the necessary information in the vast amounts of data. These are called InstaNovo and InstaNovo+ and are available to researchers through the InstaDeep website (see fact box).

"Seen together, our models exceed state-of-the-art and are significantly more precise than currently available tools. Furthermore, as we show in the paper, our models are not specific to a particular research area. Instead, these tools could propel significant advances in all fields involving proteomics," says Kevin Michael Eloff, a research engineer at InstaDeep and co-first author of the paper.

To assess the usefulness of their models, the researchers have trained and tested them on several specific tasks within major areas of interest.

One investigation was performed on wound fluid from venous leg ulcer patients. Since venous leg ulcers are notoriously difficult to treat and often become chronic, knowing which microorganisms, such as bacteria, are present is crucial to treatment. The models could map ten times as many sequences as a database search, including E. coli and Pseudomonas aeruginosa, the latter being a multidrug-resistant bacterium.

Another use case was conducted on small pieces of protein, called peptides, displayed on the surface of cells. These help the immune system recognize infections and diseases such as cancer. The InstaNovo models identified thousands of new peptides that were not found using traditional methods. In personalised cancer treatments, empowering the immune system – immunotherapy for short - these peptides are all potential attack points.

"In combination, our tests of the model on complex cases, where, for example, unknown proteins are present, or where we have no prior knowledge of the organisms involved, show that they are suitable to improve our understanding significantly. That this bodes well for biomedicine is a given, since it can directly improve identification of our microbiome, as well as improve our efforts within personalised medicine and cancer immunology," says Konstantinos Kalogeropoulos, co-first author and Assistant Professor at DTU Bioengineering.

The paper provides six additional cases demonstrating how these models improve therapeutic sequencing, discover novel peptides, detect unreported organisms, and significantly enhance proteomics searches. The implications of their results extend far beyond the medical sciences, says Timothy Patrick Jenkins:

"Looking at it from a purely technical, scientific perspective, it is also true that with these tools, we can improve our understanding of the biological world as a whole, not only in terms of healthcare but also in industry and academia. Within every field using proteomics - be it plant science, veterinary science, industrial biotech, environmental monitoring, or archaeology - we can gain insights into protein landscapes that have been inaccessible until now."

What Are InstaNovo and InstaNovo+?

InstaNovo is a transformer-based model designed for de novo peptide sequencing. Developed in collaboration between InstaDeep and the Department of Biotechnology and Biomedicine at the Technical University of Denmark (DTU), it translates fragment ion peaks from mass spectrometry data into peptide sequences with unprecedented precision.

Unlike traditional methods that rely on preexisting databases, InstaNovo identifies peptides that have never been documented before, expanding the landscape of proteomic discovery.

A key innovation of the InstaNovo models is InstaNovo+, a diffusion-based iterative refinement model that enhances sequence accuracy by mimicking how researchers manually refine peptide predictions. InstaNovo+ begins with an initial sequence- either derived from InstaNovo or generated at random- and improves it, step by step.

When paired with InstaNovo, InstaNovo+ significantly reduces false discovery rates (FDR) and improves sequence accuracy by refining predictions and exploring a broader range of potential peptide sequences.

Unlike autoregressive models such as InstaNovo and others, which predict peptide sequences one amino acid at a time, InstaNovo+ processes entire sequences holistically, enabling greater accuracy and higher detection rates.

InstaNovo and InstaNovo+ enhance de novo peptide sequencing, striking a balance between precision and exploration to accelerate biological discovery.

Source:

Technical University of Denmark and InstaDeep

Journal reference:

Eloff, K., Kalogeropoulos, K., Mabona, A., Morell, O., Catzel, R., Berg Jespersen, J., Williams, W., Van Beljouw, S. P., Skwark, M. J., Laustsen, A. H., Brouns, S. J., Ljungars, A., Schoof, E. M., Van Goey, J., Beguir, K., Lopez Carranza, N., & Jenkins, T. P. (2025). InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments. Nature Machine Intelligence, 1-15. DOI: 10.1038/s42256-025-01019-5, https://www.nature.com/articles/s42256-025-01019-5

Posted in: AI Research News