We constructed a Pearson nearest-centroid classifier for NMIBC based on the recently published classifier for the MIBC consensus subtypes27. Only samples with positive silhouette scores were used for feature selection (n = 505). We filtered the expression matrix to include genes with a median expression > 0 in at least one of the four classes and used a step-wise ANOVA approach to identify genes with significantly different expression levels across classes. ANOVA between all four classes resulted in 13,650 significant genes (BH-adjusted p-values < 0.05). Genes highly expressed in class 2b dominated the list, so we removed class 2b samples and previously significant genes from the dataset and performed a second round of ANOVA on the remaining classes. This analysis added only four significant genes to the feature list (BH-adjusted p-values < 0.05). Next, class 2a samples were removed and one last round of ANOVA between class 1 and class 3 was performed (corresponding to a t-test), resulting in 109 significant genes (BH-adjusted p-values < 0.05). Thereby, a total number of 13,762 genes were suggested to be differentially expressed between classes. The step-wise ANOVA approach was chosen instead of multiple pairwise t-tests to reduce the number of statistical tests while still accessing differences between all classes. We computed the AUC associated with each gene for prediction of the four classes and kept genes with an AUC > 0.6 (n = 10,149). An additional filtering of genes was performed to only keep genes with a mean expression > 0 across all samples. Overall, the initial selection of features resulted in a list of 9,451 genes.

We used leave one out cross-validation (LOOCV) to assess the classification performance associated with different subsets of the 9451 features. In each LOOCV run, we computed the mean fold-change associated with each gene for each class versus the others. Genes were ordered by their mean fold-change within each class and the four gene lists were used to generate several gene subsets. The N top upregulated and N top downregulated genes within each class, with N varying from 50 to 800, were selected and used as feature input for the classifier. We obtained the lowest LOOCV error rate when selecting the 368 top upregulated and 368 top downregulated genes within each class (1964 unique genes in total). Finally, genes appearing in > 80% of the LOOCV runs were selected and used to build the final classifier (n = 1942). We computed four centroids corresponding to the four NMIBC classes (i.e., the mean gene expression profile of the 1942 chosen feature genes for each class), and class labels are then assigned to single NMIBC samples based on the Pearson correlation between a sample’s expression profile and the four-class centroids. The NMIBC classifier is available as a web application at http://nmibc-class.dk, as an R package at https://github.com/sialindskrog/classifyNMIBC or in Supplementary Software 1.

A similar approach was used to construct a Pearson nearest-centroid classifier for the T1HG subtypes, resulting in 883 chosen feature genes.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.