Deep learning using Prosit framework

Mathias Wilhelm; Daniel P. Zolg; Michael Graber; Siegfried Gessulat; Tobias Schmidt; Karsten Schnatbaum; Celina Schwencke-Westphal; Philipp Seifert; Niklas de Andrade Krätzig; Johannes Zerweck; Tobias Knaute; Eva Bräunlein; Patroklos Samaras; Ludwig Lautenbacher; Susan Klaeger; Holger Wenschuh; Roland Rad; Bernard Delanghe; Andreas Huhmer; Steven A. Carr; Karl R. Clauser; Angela M. Krackhardt; Ulf Reimer; Bernhard Kuster

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Deep learning using Prosit framework

MW Mathias Wilhelm

DZ Daniel P. Zolg

MG Michael Graber

SG Siegfried Gessulat

TS Tobias Schmidt

KS Karsten Schnatbaum

CS Celina Schwencke-Westphal

PS Philipp Seifert

NK Niklas de Andrade Krätzig

JZ Johannes Zerweck

TK Tobias Knaute

EB Eva Bräunlein

PS Patroklos Samaras

LL Ludwig Lautenbacher

SK Susan Klaeger

HW Holger Wenschuh

RR Roland Rad

BD Bernard Delanghe

AH Andreas Huhmer

SC Steven A. Carr

KC Karl R. Clauser

AK Angela M. Krackhardt

UR Ulf Reimer

BK Bernhard Kuster

This method is extracted from research article: Nat Commun, Jun 2021

Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics

DOI: 10.1038/s41467-021-23713-9

Request a Protocol

Ask a question

Favorite

Preparation of training data and deep learning of peptide fragment ion intensities was performed using the Prosit model architecture^⁷ using keras (2.1.1), tensorflow (1.4.0), numpy (1.14.5) and scipy (1.1.0). The peptide encoder consists of 3 layers: a bi-directional recurrent neural network (BDN) with gated recurrent memory units (GRU), a recurrerrent GRU layer and an attention layer all with dropout. The recurrent layers use 512 memory cells each. The latent space is 512-dimensional. Precursor charge and NCE encoder is a single dense layer with the same output size as the peptide encoder. The latent peptide vector is decorated with the precursor charge and NCE vector by element-wise multiplication. A 1-layer length 29 BDN with GRUs, dropout and attention acts as decoder for fragment intensity. A keras model file was deposited in GitHub (www.github.com/kusterlab/prosit/) and zendoro with DOI zenodo.4721353^³⁴. In brief, the publicly available ProteomeTools data (PRIDE Dataset PXD004732, PXD010595), as well as the data presented in this study (PRIDE Dataset PXD021013), were utilized as training data. RAW data was searched using MaxQuant (version 1.5.3.30) using standard settings with 1% FDR filter at PSM, Protein or Site level. Unprocessed spectra for MaxQuant’s rank 1 PSMs (from msms.txt) were extracted from the RAW files using Thermo Fisher’s RawFileReader (http://planetorbitrap.com/rawfilereader) and b- and y-ions annotated for fragment charges 1 up to 3. Initial data included all PSMs for the same peptide and restricted peptides length to 7 to 30 amino acids length and precursor charge to <7 and Andromeda score to >40. NCE values of all runs were calibrated as described and spectra were transformed into a tensor format compatible with the machine learning models.

HCD Training data was split into three distinct sets with each peptide sequence only included in one of the three: Training (70%, ~593k modified peptides, ~9.9 M PSMs), Test (20%, ~170 k modified peptides, ~2.8 M PSMs) and Holdout (10%, ~84 k modified peptides, ~ 1.4 M PSMs). CID Training data was split into three distinct sets with each peptide sequence only included in one of the three: Training (70%, ~500 k modified peptides, ~2.9 M PSMs), Test (20%, ~142k modified peptides, ~0.8 M PSMs) and Holdout (10%, ~71 k modified peptides, ~ 0.4 M PSMs). The model was trained and optimized on Training. Test was used to control for overfitting with early stopping. The Holdout dataset was used to evaluate the model’s generalization and potential biases. Normalized spectral contrast loss^⁷ was used as a loss function. We used the Adam optimizer with a cyclic learning rate (CLR) algorithm^³⁵. During training, the learning rate will cycle between a constant lower limit (0.0000001) and an upper limit (initially training started with 0.001, after restart for HCD 0.0001) which is continuously scaled by a factor of 0.95 every 8 epochs. Models were trained on a Nvidia TitanXp GPUs with 512 samples per batch. Models were trained on a Nvidia TitanXp GPUs for 195 epochs. During training, fragment ions out-of-range or out-of-charge were masked and not regarded in further analysis. For example, for a length 10 peptide y- and b-ions 10 to 29 as well as fragment ions with a charge higher than its precursor are masked. For training, the data was restricted to the top 3 highest scoring spectra for a peptide sequence, modification status, precursor charge, fragmentation method, fragmentation energy, and mass analyzer combination.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol