The generation of a null distribution of ESs for a given gene set is important both for assigning relative rankings to gene sets of different sizes and for assigning significance levels. In the GSEA algorithm for pre-ranked gene lists, null distributions are generated by sampling a random gene set Gk′ containing the same number of members in the ranked list as the original set Gk and recalculating ES. This implicitly defines a null hypothesis of no association between genes, which, for large gene sets, can result in highly sensitive estimates of significance at the expense of specificity. Therefore, by default, GSPA generates null distributions by first resampling the original gene set to create Gk′, then creating a null set of proximal genes Pk′ as in the original ES calculation for GSPA (Fig. 1B). A null ES is defined from Pk′, and this procedure is repeated a fixed number of times (100 by default). Alternatively, users can test a less stringent null hypothesis by directly resampling Pk itself. Both methods constitute hybrid null hypotheses (i.e. that relative gene expression patterns do not differ between the gene set and background genes) that reduce precisely to the original GSEA prerank algorithm as r decreases to zero, but the former method directly accounts for known correlations between genes (Maleki et al., 2020). Once the ES and null ES distribution have been calculated, normalized ES (NES; a normalized transformation of ES accounting for gene set size), P-value and false discovery rate (FDR) are calculated as in GSEA (Subramanian et al., 2005).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.