Data and Statistical Analyses

Ryan J. King; Fang Qiu; Fang Yu; Pankaj K. Singh

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Data and Statistical Analyses

RK Ryan J. King

FQ Fang Qiu

FY Fang Yu

PS Pankaj K. Singh

This method is extracted from research article: Front Cell Dev Biol, Jul 2021

Metabolic and Immunological Subtypes of Esophageal Cancer Reveal Potential Therapeutic Opportunities

DOI: 10.3389/fcell.2021.667852

Request a Protocol

Ask a question

Favorite

ActiveState Perl5 version 5.24.1^¹ was used to gather and organize data, perform Student’s t-tests, Benjamini–Hochberg corrections, and quartile quantifications, to feed commands to GSEA through Java, to generate and execute R scripts, and to record the output. Bar graphs were plotted and analyzed in GraphPad Prism 5 (GraphPad Software Inc., San Diego, CA, United States). Machine learning was conducted solely in R using the R package randomForest and H2O v3.32.0.1 (Liaw and Wiener, 2001; H2O.ai, 2016). Cohorts were randomly assigned with a seed of 123 giving 80% of the data for training in the non-tuned randomForest, while 40% of the data were used for training in H2O. Twenty percent of the data were used for validation and testing in H2O with a maximum of 200 models generated for hyperparameter tuning, when applicable, for distributed random forest, gradient boosting machine, deep learning, and generalized linear model. In all machine learning cases, a seed of 123 was set prior for the run.

Partial least-squares discriminant analysis (PLS-DA) was generated through the R package “mixOmics” (Le Cao et al., 2009; Gonzalez et al., 2012; Rohart et al., 2017). mixOmics v6.12.2 was utilized for feature selection through sparse partial least-squares discriminant analysis (sPLS-DA), subsequent tuning, and the resulting performance assessment. Upper quartile-normalized RSEM was converted to log₂(RSEM +1) before scaling. All randomization events were preset with a seed of 123. Feature selection tuning grid consisted of evaluating the performance when including 1–10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, and 300 genes. For EAC vs. ESCC sPLS-DA discrimination, 80% of the cohorts went into training with rounding in effect after randomization. For EAC vs. ESCC vs. normal tissue, 75% of the data went to training, as 80% resulted in a testing cohort having only two normal adjacent tissue samples. Cross-validation was done with leave-one-out (LOO). The area under the curve (AUC) was analyzed using the R package “pROC” v1.16.2 (Robin et al., 2011).

R versions 3.3.2 and 3.5.1^² were responsible for the remaining analysis, including heatmaps through the R package “gplots” (Warnes et al., 2020). For Supplementary Figure 5, hierarchical cluster analysis was performed using Ward’s minimum-variance method and applied to data with greater variability using the “factoextra” package in R, while heatmaps were generated using Genesis 1.8.1 (Graz University of Technology, Graz, Austria). Overall survival in Supplementary Figure 5 was plotted using the Kaplan–Meier method and compared between cluster groups using log-rank tests via SAS version 9.4 (SAS Institute, Cary, NC, United States). Survival analyses for the genes in the Supplementary Tables were analyzed with the function “survdiff” from R package “survival” using the Mantel–Haenszel log-rank test (Grambsch and Therneau, 2000; Therneau, 2020). When Kaplan–Meier curves were presented, p-values were from GraphPad Prism 5, using the Mantel–Cox log-rank test for significance. The Mann–Whitney U test was conducted in R with function “wilcox.test,” and Spearman’s correlations were calculated utilizing R package “Hmisc” (Harrell and Dupont, 2020).

GraphPad Prism 5 was also utilized to calculate Mann–Whitney U or Student’s t-test when two categories existed and Kruskal–Wallis H test or one-way ANOVA with Bonferroni’s multiple comparison test when more than two categories existed. Error bars represent the standard error of the mean. Prism also calculated Fisher’s exact test when two categorical categories existed, and chi-square was used when there were more than two categorical categories, except where mentioned.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol