Distributed gene representations generated by G2Vec were used to group genes and to compute gene scores for identification of prognostic biomarkers. Gene vectors formed two gene groups associated with either good or poor outcome groups (Fig. 1). These groups named L-groups could be detected with K-means clustering algorithm (Supplementary Fig. S1). We then selected prognostic biomarkers from each L-group with gene scores. A gene score was defined by the means of d-scores and t-scores. A d-score is the Euclidean distance between a gene vector and the center of initial gene vectors (zero vector). A t-score is the absolute value of t-statistics measuring the difference of gene expression levels between good and poor outcomes. Both scores were normalized from 0 to 1 by min-max transformation. We selected 50 genes with high gene scores from each L-group, resulting in 100 biomarkers. For each fold, 100 biomarkers were identified using training data and validated with test data using the random forest classifier.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.