Gene mRNA expression data from primary tumors and related clinical data of 452 patients in The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) project were obtained from cBioPortal as the discovery set, and gene expression data from normal adjacent tissues of 41 patients in the TCGA-COAD were obtained from the UCSC Xena as the reference set [22]. The mRNA sequence data of the discovery set and reference set used in this study were generated with the Illumina HiSeq 2000 platform and processed by the RNAseqV2 pipeline, which uses RNA-Seq by expectation maximization upper quartile (RSEM-UQ) for quantification. To validate the prognostic performance of the identified pathway-based factors, one independent dataset that offered identical clinical data and gene mRNA expression from primary tumors generated with a similar pipeline of 106 colon cancer patients was obtained from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) from the LinkedOmics as the validation set [23]. The mRNA sequence data of the validation set used in this study were generated with the Illumina HiSeq 4000 platform and processed by the RNAseqV2 pipeline with RSEM-UQ for quantification. Both datasets can be used for an integrated analysis of clinical data and omics data.

Patients with primary tumors with both clinical data and gene expression data in the discovery set and validation set were included in this study. All data were cleaned and checked after data acquisition. The clinical data included T, N, and M stages and overall survival information. Other clinical prognostic factors, such as age and location, were not included because this study is focused on supplementing the clinical TNM staging system. The T stage was categorized into T1, T2, T3, and T4 stages (1 = T1, 2 = T2, 3 = T3, and 4 = T4 in subsequent analyses); the N stage was categorized into N0, N1, and N2 stages (0 = N0, 1 = N1, and 2 = N2 in subsequent analyses); and the M stage was categorized into M0 and M1 stages (0 = M0 and 1 = M1 in subsequent analyses). All gene expression data values were further log-transformed (Log2 (value + 1)) for subsequent analysis.

The following exclusion criteria were applied to the samples: containing Tis, N1c, or MX; lack of clear T, N, and M stages; and invalid survival information. In gene expression data, genes that could not be targeted with accurate HUGO Gene Nomenclature Committee (HGNC) symbols in the discovery set, validation set, and reference set were removed. Besides, genes with missing expression values or zero values were removed as well.

