For each sequence in each dataset, we constructed the pseudo-amino acid composition [38] with λ = 55, weight = 0.7 and using established hydrophobicity values [39], hydrophilicity values [40] and side chain mass values [41]. We also determined the 400 character, dipeptide composition of each sequence in the dataset. The dipeptide composition [f(x,y)] of any combination of two amino acid residues represented as x and y, for each sequence was computed as
where n is the length of the sequence and ai represents amino acid residue at position i. In total, each sequence was represented as a vector of 475 features, including the amino acid composition (20 characters), Chou’s pseudo-amino acid composition (λ = 55) and dipeptide frequency (400 characters).
The SVMs were constructed using LIBSVM [42] extension in R v.3.2.0 [43] using the e1071 package [44]. For comparative purposes, five models were constructed using radial basis kernel, each with different sets of features and kernel parameters that were tuned with five-fold cross validation. The first model, named ‘Amino’, was built using 20 amino acid frequencies as features; the second model, called ‘Chemistry’, was built using 55 features based on the hydrophobicity, hydrophilicity and side chain-mass. The third model, ‘Chou’, was built using Chou’s pseudo-amino acid composition by combining the 20 amino acid and 55 chemical information features. The fourth model, named ‘Dipeptide’, was built using 400 dipeptide composition features. The last model, ‘Classifier’, was built using all 475 features.
The classification models were validated using five-fold cross-validation, and assessed against the classifications of known ion channel and aquaporin sequences encoded in the human and C. elegans genomes. Receiver operating characteristic (ROC) analysis [45] was conducted to evaluate the performance of each model. For comparative purposes, we also assessed the test dataset using other probabilistic classification methods, including random forest, classification via logistic regression and prior classifier, conducted using established methods [46–48]. Using the best-performing classification models, confusion matrices were constructed to further evaluate each model and compare their performance based on the final table of confusion. For the final model, the average classification probability values for individual subfamilies in the test dataset were computed; these probability values were utilised to classify the ion channels predicted from the parasite dataset.
Protein categories were classified based on SVM probability values: Category A proteins had probability values greater than or equal to the subfamily probability threshold. Category B proteins had probability values between 50 % of the subfamily probability threshold and the subfamily probability threshold. Category C proteins had probability values less than 50 % of the subfamily probability threshold. A confidence ranking was given to our ion channel classifications. High confidence classifications included channels in Category A (Groups 1 to 4) and Category B (Groups 1 and 2), which were annotated by SVM subfamily classification. Medium confidence classifications included channels in Category B (Groups 3 and 4), which were annotated by SVM subfamily-classifications and designated with the suffix, “-like” (e.g. GABA-like ion channel). Low confidence classifications included all proteins in Category C (Groups 1 to 4), which represented ion channel-like proteins but could not be confidently assigned to a particular family or subfamily.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.