Preparation of training data and deep learning of peptide fragment ion intensities was performed using the Prosit model architecture7 using keras (2.1.1), tensorflow (1.4.0), numpy (1.14.5) and scipy (1.1.0). The peptide encoder consists of 3 layers: a bi-directional recurrent neural network (BDN) with gated recurrent memory units (GRU), a recurrerrent GRU layer and an attention layer all with dropout. The recurrent layers use 512 memory cells each. The latent space is 512-dimensional. Precursor charge and NCE encoder is a single dense layer with the same output size as the peptide encoder. The latent peptide vector is decorated with the precursor charge and NCE vector by element-wise multiplication. A 1-layer length 29 BDN with GRUs, dropout and attention acts as decoder for fragment intensity. A keras model file was deposited in GitHub (www.github.com/kusterlab/prosit/) and zendoro with DOI zenodo.472135334. In brief, the publicly available ProteomeTools data (PRIDE Dataset PXD004732, PXD010595), as well as the data presented in this study (PRIDE Dataset PXD021013), were utilized as training data. RAW data was searched using MaxQuant (version 1.5.3.30) using standard settings with 1% FDR filter at PSM, Protein or Site level. Unprocessed spectra for MaxQuant’s rank 1 PSMs (from msms.txt) were extracted from the RAW files using Thermo Fisher’s RawFileReader (http://planetorbitrap.com/rawfilereader) and b- and y-ions annotated for fragment charges 1 up to 3. Initial data included all PSMs for the same peptide and restricted peptides length to 7 to 30 amino acids length and precursor charge to <7 and Andromeda score to >40. NCE values of all runs were calibrated as described and spectra were transformed into a tensor format compatible with the machine learning models.
HCD Training data was split into three distinct sets with each peptide sequence only included in one of the three: Training (70%, ~593k modified peptides, ~9.9 M PSMs), Test (20%, ~170 k modified peptides, ~2.8 M PSMs) and Holdout (10%, ~84 k modified peptides, ~ 1.4 M PSMs). CID Training data was split into three distinct sets with each peptide sequence only included in one of the three: Training (70%, ~500 k modified peptides, ~2.9 M PSMs), Test (20%, ~142k modified peptides, ~0.8 M PSMs) and Holdout (10%, ~71 k modified peptides, ~ 0.4 M PSMs). The model was trained and optimized on Training. Test was used to control for overfitting with early stopping. The Holdout dataset was used to evaluate the model’s generalization and potential biases. Normalized spectral contrast loss7 was used as a loss function. We used the Adam optimizer with a cyclic learning rate (CLR) algorithm35. During training, the learning rate will cycle between a constant lower limit (0.0000001) and an upper limit (initially training started with 0.001, after restart for HCD 0.0001) which is continuously scaled by a factor of 0.95 every 8 epochs. Models were trained on a Nvidia TitanXp GPUs with 512 samples per batch. Models were trained on a Nvidia TitanXp GPUs for 195 epochs. During training, fragment ions out-of-range or out-of-charge were masked and not regarded in further analysis. For example, for a length 10 peptide y- and b-ions 10 to 29 as well as fragment ions with a charge higher than its precursor are masked. For training, the data was restricted to the top 3 highest scoring spectra for a peptide sequence, modification status, precursor charge, fragmentation method, fragmentation energy, and mass analyzer combination.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.