According to stratified random sampling, we divided the microarray matrix into a training set and a validation set. The training set and the validation set do not contain repeated samples. We initially selected 55 MS susceptibility locus genes identified by IMSGC and WTCCC2 as predictors for the model. Therefore, we excluded the other genes in the microarray matrix to reduce the amount of computation. Since GPL 570 does not have a probe to annotate TNFRSF6B, there are only 54 genes in the training set and the validation set. After importing data into R, the “status” was transformed into a classification variable and no missing values were found in the data. Subsequently, we further screened the predictors through Random Forests based on the R package “randomForest.” When the classification tree reaches about 100 trees, the classification of Random Forests tends to be stable (as shown in Figure 2). At this point, we got the Gini index of the predictors and screened out the first 8 genes with a Gini index greater than 1.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.