Training and testing

Shuangxi Ji; Tuğçe Oruç; Liam Mead; Muhammad Fayyaz Rehman; Christopher Morton Thomas; Sam Butterworth; Peter James Winn

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Training and testing

SJ Shuangxi Ji

TO Tuğçe Oruç

LM Liam Mead

MR Muhammad Fayyaz Rehman

CT Christopher Morton Thomas

SB Sam Butterworth

PW Peter James Winn

This method is extracted from research article: PLoS One, Jan 2019

DeepCDpred: Inter-residue distance and contact prediction for improved prediction of protein structure

DOI: 10.1371/journal.pone.0205214

Request a Protocol

Ask a question

Favorite

The main test set consists of 108 from the 150 proteins of the MetaPSICOV test set, so we can be sure that the test proteins are not in the training set of either MetaPSICOV or DeepCDpred. Additionally, these 108 proteins are not listed in the training set of RaptorX [11]. A chain was removed from the MetaPSICOV test set when a sequence with >25% identity to it was found in the training set of SPIDER2 [21], since we used SPIDER2 for secondary structure prediction, which is included in our feature vector and for subsequent structural modelling. This gave 108 protein chains ranging from 52 to 266 amino acids with 25% or less sequence identity to each other. Based on annotation in the PDB, 87 chains are monomers in the biological unit, and 21 are from multimeric complexes of some sort, one chain is a membrane protein. The PDB IDs of these 108 protein chains are listed in S3 Table.

Even though the maximum sequence identity is 25% between the training and the test sets, some of the proteins in our test set have common topology classes (and homologous superfamily classes) with the training set proteins, based on CATH classification [22]. In order to test whether our trained model has a bias towards predicting contacts and distances for structures with training set topologies, we generated another test set, as described in supplementary material, with 50 proteins that do not have the same topology as any of the training set proteins of DeepCDpred, RaptorX and MetaPSICOV, which are listed in S4 Table.

The training set was chosen from the PISCES set [23], downloaded in November 2016. The selected training set protein chains and 50 topologically independent test protein structures were solved with no worse than 2 Å resolution, a maximum R value of 0.25, with no more than 25% pairwise sequence identity to each other or the test set, and with fewer than 400 amino acids. Of these structures, 1701 chains were arbitrarily selected.

The neural network training protocols are described in supplementary methods. The accuracy of the test set predictions was calculated as

Since we make no negative predictions, i.e. we are only predicting residues to be in the distance bin, then true negatives and false negatives are both zero, and the standard formulae for accuracy and precision (PPV) become identical.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol