In a paper published in the journal Nature Communications, researchers demonstrated deep learning (DL) techniques like collaborative filtering (CF), denoising autoencoders (DAE), and variational autoencoders (VAE) for imputing missing values in mass spectrometry (MS) based proteomics data. They introduced the proteomics imputation modeling mass spectrometry (PIMMS) method and applied it to a cohort with alcohol-related liver disease (ALD).
By removing 20% of the intensities, PIMMS- variational autoencoder (VAE) recovered 15 of 17 significant abundant protein groups and identified 30 additional proteins (+13.2%) when analyzing the full dataset. These proteins predicted ALD progression in machine learning (ML) models. The study recommends DL for imputing missing values in large MS-based proteomics datasets.
Background
Past work on proteomics has focused on identifying and quantifying proteins, but missing values in MS data pose challenges for accurate analysis. The semi-stochastic precursor selection in MS data results in abundant missing values. Traditional imputation methods assume missing values are due to low protein abundance, leading to biased results. Variable missing mechanisms can cause incorrect imputations, affecting the accuracy of downstream analyses.
Proteomics Dataset Overview
The Henrietta lacks (HeLa) cell lines were measured repeatedly for maintenance (MNT) and quality control (QC) at the Novo Nordisk Foundation Center for Protein Research (NNF CPR) and Max Planck Institute of Biochemistry. After instrument cleaning, the samples were run as QC during cohort measurements or as MNT using various column lengths and liquid chromatography methods.
Different lysis protocols were used, typically involving trypsin digestion, with injection volumes ranging from one to seven microliters. This dataset, obtained using data-dependent acquisition (DDA) label-free quantification, helps explore the applicability of self-supervised learning to proteomics data.
The analysts processed 564 raw files from HeLa cell lines using snakemake workflow in maxquant 1.6.1224. The analysis used the UniProt human reference proteome database 2019_05 release for DDA, controlling contaminants with the default contaminants fasta in maxquant. They extracted precursor quantifications from evidence.txt, aggregated peptides from peptides.txt, and protein groups from proteinGroups.txt. Detailed pre-processing steps are available in a Data Descriptor.
A two-step feature and sample selection procedure were used, applying a 25% feature prevalence cutoff and 50% sample completeness threshold. The dataset was split into training (90%), validation (5%), and test (5%) sets, with simulated missing values (75% missing completely at random (MCAR) and 25% missing not at random (MNAR)). Validation data was used for early stopping, ensuring the performance on validation and test data was similar. This strategy ensured comprehensive evaluation across intensity ranges.
The clinical dataset included 457 plasma samples from liver disease patients, measured using data-independent acquisition (DIA) and processed with Spectronaut v.15.452. Peptide and protein group quantifications were extracted, with feature selection following the HeLa data strategy. For differential abundance analysis, 348 complete clinical samples were used, employing a standardized workflow for comparison. Analysis of covariance (ANCOVA) was used for differential analysis, controlling for various covariates and correcting for multiple testing with Benjamini-Hochberg's correction.
Proteomics Imputation Study
The study evaluated three unsupervised DL models for imputing proteomics data. These included CF, a DAE, and a VAE with a stochastic latent space. Comparisons were made against traditional methods like median imputation and advanced techniques such as k-nearest neighbors (KNN) and random forest (RF). Among them, nine methods are needed to scale effectively with high-dimensional data. The analysis, conducted on a dataset of 564 HeLa runs, measured imputation performance using mean absolute error (MAE) on log2-scaled intensities for simulated missing values.
Results indicated that CF, DAE, and VAE achieved MAE values of 0.55, 0.54, and 0.58, respectively, outperforming median imputation (MAE = 1.24). Bayesian principal component analysis (BPCA) slightly surpassed other methods with an MAE of 0.53 on protein groups. Across different data aggregation levels (protein groups, aggregated peptides, and precursors), the DL models showed comparable performance to traditional methods, albeit with decreasing efficacy as the percentage of simulated missing values increased.
Self-supervised DL effectively imputes proteomics data, especially at lower data aggregation levels, despite challenges from high-dimensional and missing-value datasets. Future research should enhance models for better accuracy and scalability in biological data analysis, advancing reliability in proteomics research.
Conclusion
To sum up, imputation techniques such as CF, DAE, and VAE successfully replaced missing measurements in label-free mass spectrometry-based proteomics data quantification. The method PIMMS applied to an alcohol-related liver disease cohort with 358 individuals demonstrated significant recovery of abundant protein groups and identification of additional proteins associated with disease stages.
ML models further validated the predictive value of these proteins in ALD progression. It underscores the efficacy of DL approaches in handling missing data in large-scale proteomics studies, offering robust workflows for future research.
Journal reference:
- Webel, H., et al. (2024). Imputation of label-free quantitative mass spectrometry-based proteomics data using self-supervised deep learning. Nature Communications, 15:1, 5405. DOI:10.1038/s41467-024-48711-5, https://www.nature.com/articles/s41467-024-48711-5