Enrichment score

Jonathan D. Rubin; Jacob T. Stanley; Rutendo F. Sigauke; Cecilia B. Levandowski; Zachary L. Maas; Jessica Westfall; Dylan J. Taatjes; Robin D. Dowell

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Enrichment score

JR Jonathan D. Rubin

JS Jacob T. Stanley

RS Rutendo F. Sigauke

CL Cecilia B. Levandowski

ZM Zachary L. Maas

JW Jessica Westfall

DT Dylan J. Taatjes

RD Robin D. Dowell

This method is extracted from research article: Commun Biol, Jun 2021

Transcription factor enrichment analysis (TFEA) quantifies the activity of multiple transcription factors from a single experiment

DOI: 10.1038/s42003-021-02153-7

Ask a question

Favorite

With the motif instances identified for each of the ranked ROIs, we now detail how TFEA calculates the enrichment score (“E-score”—in Fig. 1) for each TF. The procedure for calculating enrichment requires two inputs:

N-tuple ordered list $({\hat{μ}}_{i})$ —the genomic coordinates for reference points, assumed to be the centers of all ROIs (e.g., consensus ROIs calculated by muMerge), ranked by DESeq p value (separated by the sign of the fold-change).

Ordered list (m_i)—the genomic coordinates of each max-scoring motif instance (e.g., motif locations generated by scanning with FIMO), for each ROI.

We first calculate the motif distance d_i for each ROI—the distance from each ${\hat{μ}}_{i}$ to the highest scoring motif instance m_i within 1.5 kb of ${\hat{μ}}_{i}$ . If no m_i exists within 1.5 kb, then d_i is assigned a null value ( $⊘$ ) (Eq. (4)).

We use the distribution of these distances to calculate a weighted contribution to the E-score for each motif instance. In previous work, it has been observed that the distribution of motif position relative to sites of RNA polymerase initiation decays rapidly with increased distance²⁶. Thus we have chosen to model the motif weights with an exponential function, whose decay length is independently determined for each TF, from the background motif distribution. In order to compute the weight model, we next calculate the background distribution of motif distances. We assume the majority of the ROIs experience no significant fold-change—namely, those ROIs in the middle of the ranked list. Consequently, we calculate the mean, background motif distance (Eq. (5)) for those ROIs whose rank is between the first and third quartiles of the ordered list of ROI positions, $({\hat{μ}}_{i})$ , as follows

where Q₁ and Q₃ are the first and third quartiles, respectively. Our assumption is that the interquartile range of the ordered list $({\hat{μ}}_{i})$ —between indices Q₁ and Q₃—represents the background distribution of motif distances for the given TF, and therefore defines the weighting scale for significant ROIs in our enrichment calculation. We found this to be essential since the background distribution varies between TFs. This variation in the background can be attributed to the random similarity of a given motif to the base content surrounding the center of ROIs. For example, in the case of RNA polymerase loading regions identified in nascent transcription data (which demonstrate a greater GC-content proximal to μ as compared to genomic background²⁶), GC-rich TF motifs were more likely to be found proximal to each ROI by chance and thus resulted in a smaller $\bar{d}$ than would be the case for a non-GC-rich motif.

Having calculated the mean background motif distance, we proceed to calculate the enrichment contribution (i.e., weight—Eq. (6)) for each ROI in the ordered list (see “Weight Calculation” in Fig. 1).

In order to calculate the E-score, we first generate the enrichment curve for the given TF (solid line in “Enrichment Curve” in Fig. 1) and the background (uniform) enrichment curve (dashed line in “Enrichment Curve” in Fig. 1). We define the E-score as the integrated difference between these two (scaled by a factor of 2, for the purpose of normalization). The enrichment curve (Eq. (7)), which is the normalized running sum of the ROI weights, and the E-score (Eq. (8)) are calculated as follows:

where i is the index for the ROI rank and i/N represents the uniform, background enrichment value for the ith of N ROIs. The background enrichment assumes every ROI contributes an equal weight w_i, regardless of its ranking position. Therefore, the enrichment curve (Eq. (7)) will deviate significantly from the background if there is a correlation between the weight and ranked position of the ROIs. In this case, the E-score will significantly deviate from zero, with E > 0 indicating either the increased activity of an activator TF or decreased activity of a repressor TF. Likewise, E < 0 indicates either a decrease in an activator TF or an increase in a repressor TF. By definition, the range of the E-score is −1 to +1.

Unlike GSEA, which uses a Kolmogorov–Smirnov-like statistic to calculate its enrichment score⁴⁴, the TFEA E-score is an area-based statistic. GSEA was designed to identify if a predetermined, biologically related subset of genes is over-represented at the extremes of a ranked gene list. Therefore, the KS-like statistic is a logical choice for measuring how closely clustered are the elements of the subset since it directly measures the point of greatest clustering and otherwise is insensitive to the ordering of the remaining elements. Conversely, because TFEA’s ranked list does not contain two categories of elements (the ROIs) and all elements can contribute to the E-Score, we wanted a statistic that was sensitive to how all ROI in the list were ranked—for this reason, we chose the area-based statistic. The null hypothesis for TFEA assumes all ROIs contribute equally to enrichment, regardless of their motif co-localization and rank. Hence the uniform background curve, to which the enrichment curve is compared.

In order to determine if the calculated E-score (Eq. (8)) for a given TF is significant, we generate an E-score null distribution from random permutations of $({\hat{μ}}_{i})$ . We generate a set of 1000 null E-scores ${E_{i}^{'}}$ , each calculated from an independent random permutation of the ranked ROIs, $({\hat{μ}}_{i})$ . Our E-score statistic is zero-centered and symmetric, therefore we assume ${E_{i}^{'}} ~ N (E_{0}, σ_{E}^{2})$ . The final E-score for the TF is compared to this null distribution to determine the significance of the enrichment.

Prior to calculating the E-score p value, we apply a correction to the E-score based on the GC-content of the motif relative to that of all other motifs to be tested (user-configurable). This correction was derived based on the observation that motifs at the extremes of the GC-content spectra were more likely to call significant across a variety of perturbations. We calculate the E-Scores for the full set of TFs as well as the GC-content of each motif, {(g_i, E_i)}. We then calculate a simple linear regression for the relationship between the two

where $Ē$ and $ḡ$ are the average E-score and average GC-content. E_GC(g) is the amount of the E-score attributed to the GC-bias for a motif with GC-content g. Thus the final E-score for the TF is given by E_TF = E − E_GC(g_TF), the difference between Eqs. (8) and (10). If GC-content correction is not performed, then Eq. (8) is taken to be the final E-score. The p value for the final TF E-score is then calculated from the Z-score, Z_TF = (E_TF − E₀)/σ_E.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol