The TargetP 2.0 model is described in Fig 2. The model consists of two key components, a BiRNN with LSTM cells and a multi-attention mechanism (Lin et al, 2017 Preprint) to predict both the type of peptide and the position of the CS.
The input to this model is the first 200 amino acids of a protein. This threshold was chosen based on the maximum length of known transit peptides, which is 162 amino acids (Stefely et al, 2015). The amino acids in the protein are encoded using BLOSUM62 substitution matrices.
We first describe the model at a high level and give more details on each of the layers below: The first layer of the model is a fully connected layer to perform a feature transformation of each amino acid input feature with 32 hidden units. The following layer is the BiLSTM with 256 hidden units in both forward and backward directions. The first hidden state to the BiLSTM is a vector containing the group information, which denotes whether the protein is a plant or nonplant protein. The 512-dimensional concatenated output from the BiLSTM is then used to calculate the multi-attention matrix similarly to those applied in machine translation (Bahdanau et al, 2014 Preprint; Luong et al, 2015 Preprint). The attention size is 144 units and the number of outputs from the attention matrix is of size 13. Of these 13 attention vectors, four were used to predict the different CS positions for SP, mTP, cTP, and luTP. The attention matrix is further utilised to encode the whole sequence into a context matrix. This context matrix of size 512 × 13 is processed by a fully connected layer with 256 units, to summarise it into a vector. Finally, this is fed to the output layer with 5 units and softmax activation.
We train a model that learns to predict the type of peptide and the position of the corresponding CS (y, y′) where y is the predicted type of peptide, y′ the predicted CS position, f the model, θ the learnable parameters, and X the protein sequence. Here, y is a vector of size equal to the number of classes C, five in this case, and y′ is a vector of size equal to the length of the sequence L, which can be up to 200. The θ parameters are optimised using an extension of stochastic gradient descent, Adam with cross-entropy loss for both types of peptide and CS prediction. Both losses were then averaged. The only regularisation technique used was dropout between the different layers.
The network has three main types of layers: fully connected, RNN with LSTM cell, and multi-attention layer. The first fully connected layer c applies a feature transformation:
where xt is an amino acid at position t in the sequence and W and b are the learnable weights and biases. The first layer is followed by a BiRNN that utilises an LSTM cell to capture the context around each amino acid in the sequence. The RNN applies the same set of weights to each position t
where and are the hidden states of the RNN at position t for the forward and backward directions, respectively. The hidden states are concatenated into .
The last part of the network is a multi-attention mechanism. Here, we calculate multiple attention vectors A from the LSTM hidden states, instead of just one single attention vector a. The attention matrix is then used to create multiple fixed-sized representations of the input sequence, with a different focus on the relevant parts of the sequences. The attention matrix is calculated as follows:
where Wa and Wb are weight matrices and ba is the bias of the attention function. The advantage of having multiple attention vectors is that some of them can be used to predict the position of the CS, as they are vectors of size equal to the sequence length L summing to 1. Therefore, 4 of the 13 attention vectors that the model uses are used in the prediction of the SP, mTP, cTP, and luTP CS:
To encode the sequence of hidden states into a fixed sized matrix, the hidden states are multiplied by the attention matrix and summed up:
where e matrix is the encoded representation of the protein sequence. e holds a total of 13 different representations of the protein sequences; therefore, it is needed to summarise this matrix into a vector. This is done by a final feed-forward layer, which converts E into a representation vector e. This is then used to calculate the output layer of the network, to predict the type of peptide (p) y
Both outputs from the network y and y′ are trained together. The exception is for proteins belonging to the negative set, that is, noTPs that lack a CS and, therefore, there is no error to back-propagate.
The model was trained and optimised using fivefold nested cross-validation. The four inner subsets were used to train the model, where three are used for training and one for validation to identify the best set of hyper-parameters. After optimisation, the fifth set, which was kept out of the optimisation, was used to assess the test set performance. This procedure was repeated using all five subsets as the test set. The advantage of this approach is that we obtain an unbiased test set performance on the whole dataset at the expense of having to train 5 × 4 = 20 models.
Different hyper-parameters were tested to find the best model such as the number of hidden units for the LSTM, attention and fully connected layers, number of attention vectors, the learning rate, and the dropout rate. We also experimented with a convolutional neural network as the initial layer, but the best results were achieved using a filter size of 1, which is equivalent to a fully connected layer along the feature dimension.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.