We used unsupervised hierarchical clustering and clustered samples based on gene-expression profiles. We used the “complete” agglomeration method and measured the Euclidean distance between samples. The heat maps were drawn using the iheatmapr package (v0.5.1) in R (31). Diagnosis groups in the clustering were MBOT, low stage (I/II) MOC, advanced stage (III/IV) MOC, pancreas, gastric, and lower GI (colorectal and appendiceal combined). We used random forest analysis and stratified bootstrapping (32) to assess the ability of the gene-expression profiles to predict the disease class (diagnosis group) of each sample. The cohort was divided into independent training and testing sets using stratified random subsampling, maintaining a balanced proportion of samples of each disease class. The training data set was used to train a random forest classifier (the randomForest package in R, version 4.6-14) using default parameters and the classifier was benchmarked against the test set to obtain an error rate (Supplementary Methods). We repeated the above analyses 100 times to obtain a distribution of error rates, the mean overall error rate, and the mean and standard deviation of each element of the confusion matrix, to tabulate the number of samples associated with the actual and predicted class.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.