Deep Learning Boosts Proteomics Data Accuracy

In a paper published in the journal Nature Communications, researchers demonstrated deep learning (DL) techniques like collaborative filtering (CF), denoising autoencoders (DAE), and variational autoencoders (VAE) for imputing missing values in mass spectrometry (MS) based proteomics data. They introduced the proteomics imputation modeling mass spectrometry (PIMMS) method and applied it to a cohort with alcohol-related liver disease (ALD).

Study: Deep Learning Boosts Proteomics Data Accuracy. Image Credit: vectorfusionart/Shutterstock
Study: Deep Learning Boosts Proteomics Data Accuracy. Image Credit: vectorfusionart/Shutterstock

By removing 20% of the intensities, PIMMS- variational autoencoder (VAE) recovered 15 of 17 significant abundant protein groups and identified 30 additional proteins (+13.2%) when analyzing the full dataset. These proteins predicted ALD progression in machine learning (ML) models. The study recommends DL for imputing missing values in large MS-based proteomics datasets.

Background

Past work on proteomics has focused on identifying and quantifying proteins, but missing values in MS data pose challenges for accurate analysis. The semi-stochastic precursor selection in MS data results in abundant missing values. Traditional imputation methods assume missing values are due to low protein abundance, leading to biased results. Variable missing mechanisms can cause incorrect imputations, affecting the accuracy of downstream analyses.

Proteomics Dataset Overview

The Henrietta lacks (HeLa) cell lines were measured repeatedly for maintenance (MNT) and quality control (QC) at the Novo Nordisk Foundation Center for Protein Research (NNF CPR) and Max Planck Institute of Biochemistry. After instrument cleaning, the samples were run as QC during cohort measurements or as MNT using various column lengths and liquid chromatography methods.

Different lysis protocols were used, typically involving trypsin digestion, with injection volumes ranging from one to seven microliters. This dataset, obtained using data-dependent acquisition (DDA) label-free quantification, helps explore the applicability of self-supervised learning to proteomics data.

The analysts processed 564 raw files from HeLa cell lines using snakemake workflow in maxquant 1.6.1224. The analysis used the UniProt human reference proteome database 2019_05 release for DDA, controlling contaminants with the default contaminants fasta in maxquant. They extracted precursor quantifications from evidence.txt, aggregated peptides from peptides.txt, and protein groups from proteinGroups.txt. Detailed pre-processing steps are available in a Data Descriptor.

A two-step feature and sample selection procedure were used, applying a 25% feature prevalence cutoff and 50% sample completeness threshold. The dataset was split into training (90%), validation (5%), and test (5%) sets, with simulated missing values (75% missing completely at random (MCAR) and 25% missing not at random (MNAR)). Validation data was used for early stopping, ensuring the performance on validation and test data was similar. This strategy ensured comprehensive evaluation across intensity ranges.

The clinical dataset included 457 plasma samples from liver disease patients, measured using data-independent acquisition (DIA) and processed with Spectronaut v.15.452. Peptide and protein group quantifications were extracted, with feature selection following the HeLa data strategy. For differential abundance analysis, 348 complete clinical samples were used, employing a standardized workflow for comparison. Analysis of covariance (ANCOVA) was used for differential analysis, controlling for various covariates and correcting for multiple testing with Benjamini-Hochberg's correction.

Proteomics Imputation Study

The study evaluated three unsupervised DL models for imputing proteomics data. These included CF, a DAE, and a VAE with a stochastic latent space. Comparisons were made against traditional methods like median imputation and advanced techniques such as k-nearest neighbors (KNN) and random forest (RF). Among them, nine methods are needed to scale effectively with high-dimensional data. The analysis, conducted on a dataset of 564 HeLa runs, measured imputation performance using mean absolute error (MAE) on log2-scaled intensities for simulated missing values.

Results indicated that CF, DAE, and VAE achieved MAE values of 0.55, 0.54, and 0.58, respectively, outperforming median imputation (MAE = 1.24). Bayesian principal component analysis (BPCA) slightly surpassed other methods with an MAE of 0.53 on protein groups. Across different data aggregation levels (protein groups, aggregated peptides, and precursors), the DL models showed comparable performance to traditional methods, albeit with decreasing efficacy as the percentage of simulated missing values increased.

Self-supervised DL effectively imputes proteomics data, especially at lower data aggregation levels, despite challenges from high-dimensional and missing-value datasets. Future research should enhance models for better accuracy and scalability in biological data analysis, advancing reliability in proteomics research.

Conclusion

To sum up, imputation techniques such as CF, DAE, and VAE successfully replaced missing measurements in label-free mass spectrometry-based proteomics data quantification. The method PIMMS applied to an alcohol-related liver disease cohort with 358 individuals demonstrated significant recovery of abundant protein groups and identification of additional proteins associated with disease stages.

ML models further validated the predictive value of these proteins in ALD progression. It underscores the efficacy of DL approaches in handling missing data in large-scale proteomics studies, offering robust workflows for future research.

Journal reference:
  • Webel, H., et al. (2024). Imputation of label-free quantitative mass spectrometry-based proteomics data using self-supervised deep learning. Nature Communications, 15:1, 5405. DOI:10.1038/s41467-024-48711-5, https://www.nature.com/articles/s41467-024-48711-5
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, July 08). Deep Learning Boosts Proteomics Data Accuracy. AZoAi. Retrieved on September 07, 2024 from https://www.azoai.com/news/20240708/Deep-Learning-Boosts-Proteomics-Data-Accuracy.aspx.

  • MLA

    Chandrasekar, Silpaja. "Deep Learning Boosts Proteomics Data Accuracy". AZoAi. 07 September 2024. <https://www.azoai.com/news/20240708/Deep-Learning-Boosts-Proteomics-Data-Accuracy.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Deep Learning Boosts Proteomics Data Accuracy". AZoAi. https://www.azoai.com/news/20240708/Deep-Learning-Boosts-Proteomics-Data-Accuracy.aspx. (accessed September 07, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. Deep Learning Boosts Proteomics Data Accuracy. AZoAi, viewed 07 September 2024, https://www.azoai.com/news/20240708/Deep-Learning-Boosts-Proteomics-Data-Accuracy.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Deep Learning-based Gangue Sorting for Coal Plants