Construction and testing of support vector machines (SVMs)

Bahiyah Nor; Neil D. Young; Pasi K. Korhonen; Ross S. Hall; Patrick Tan; Andrew Lonie; Robin B. Gasser

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Construction and testing of support vector machines (SVMs)

BN Bahiyah Nor

NY Neil D. Young

PK Pasi K. Korhonen

RH Ross S. Hall

PT Patrick Tan

AL Andrew Lonie

RG Robin B. Gasser

This method is extracted from research article: Parasit Vectors, Mar 2016

Pipeline for the identification and classification of ion channels in parasitic flatworms

DOI: 10.1186/s13071-016-1428-2

Request a Protocol

Ask a question

Favorite

For each sequence in each dataset, we constructed the pseudo-amino acid composition [38] with λ = 55, weight = 0.7 and using established hydrophobicity values [39], hydrophilicity values [40] and side chain mass values [41]. We also determined the 400 character, dipeptide composition of each sequence in the dataset. The dipeptide composition [f(x,y)] of any combination of two amino acid residues represented as x and y, for each sequence was computed as

where n is the length of the sequence and a_i represents amino acid residue at position i. In total, each sequence was represented as a vector of 475 features, including the amino acid composition (20 characters), Chou’s pseudo-amino acid composition (λ = 55) and dipeptide frequency (400 characters).

The SVMs were constructed using LIBSVM [42] extension in R v.3.2.0 [43] using the e1071 package [44]. For comparative purposes, five models were constructed using radial basis kernel, each with different sets of features and kernel parameters that were tuned with five-fold cross validation. The first model, named ‘Amino’, was built using 20 amino acid frequencies as features; the second model, called ‘Chemistry’, was built using 55 features based on the hydrophobicity, hydrophilicity and side chain-mass. The third model, ‘Chou’, was built using Chou’s pseudo-amino acid composition by combining the 20 amino acid and 55 chemical information features. The fourth model, named ‘Dipeptide’, was built using 400 dipeptide composition features. The last model, ‘Classifier’, was built using all 475 features.

The classification models were validated using five-fold cross-validation, and assessed against the classifications of known ion channel and aquaporin sequences encoded in the human and C. elegans genomes. Receiver operating characteristic (ROC) analysis [45] was conducted to evaluate the performance of each model. For comparative purposes, we also assessed the test dataset using other probabilistic classification methods, including random forest, classification via logistic regression and prior classifier, conducted using established methods [46–48]. Using the best-performing classification models, confusion matrices were constructed to further evaluate each model and compare their performance based on the final table of confusion. For the final model, the average classification probability values for individual subfamilies in the test dataset were computed; these probability values were utilised to classify the ion channels predicted from the parasite dataset.

Protein categories were classified based on SVM probability values: Category A proteins had probability values greater than or equal to the subfamily probability threshold. Category B proteins had probability values between 50 % of the subfamily probability threshold and the subfamily probability threshold. Category C proteins had probability values less than 50 % of the subfamily probability threshold. A confidence ranking was given to our ion channel classifications. High confidence classifications included channels in Category A (Groups 1 to 4) and Category B (Groups 1 and 2), which were annotated by SVM subfamily classification. Medium confidence classifications included channels in Category B (Groups 3 and 4), which were annotated by SVM subfamily-classifications and designated with the suffix, “-like” (e.g. GABA-like ion channel). Low confidence classifications included all proteins in Category C (Groups 1 to 4), which represented ion channel-like proteins but could not be confidently assigned to a particular family or subfamily.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol