With the motif instances identified for each of the ranked ROIs, we now detail how TFEA calculates the enrichment score (“E-score”—in Fig. 1) for each TF. The procedure for calculating enrichment requires two inputs:
N-tuple ordered list —the genomic coordinates for reference points, assumed to be the centers of all ROIs (e.g., consensus ROIs calculated by muMerge), ranked by DESeq p value (separated by the sign of the fold-change).
Ordered list (mi)—the genomic coordinates of each max-scoring motif instance (e.g., motif locations generated by scanning with FIMO), for each ROI.
We first calculate the motif distance di for each ROI—the distance from each to the highest scoring motif instance mi within 1.5 kb of . If no mi exists within 1.5 kb, then di is assigned a null value () (Eq. (4)).
We use the distribution of these distances to calculate a weighted contribution to the E-score for each motif instance. In previous work, it has been observed that the distribution of motif position relative to sites of RNA polymerase initiation decays rapidly with increased distance26. Thus we have chosen to model the motif weights with an exponential function, whose decay length is independently determined for each TF, from the background motif distribution. In order to compute the weight model, we next calculate the background distribution of motif distances. We assume the majority of the ROIs experience no significant fold-change—namely, those ROIs in the middle of the ranked list. Consequently, we calculate the mean, background motif distance (Eq. (5)) for those ROIs whose rank is between the first and third quartiles of the ordered list of ROI positions, , as follows
where Q1 and Q3 are the first and third quartiles, respectively. Our assumption is that the interquartile range of the ordered list —between indices Q1 and Q3—represents the background distribution of motif distances for the given TF, and therefore defines the weighting scale for significant ROIs in our enrichment calculation. We found this to be essential since the background distribution varies between TFs. This variation in the background can be attributed to the random similarity of a given motif to the base content surrounding the center of ROIs. For example, in the case of RNA polymerase loading regions identified in nascent transcription data (which demonstrate a greater GC-content proximal to μ as compared to genomic background26), GC-rich TF motifs were more likely to be found proximal to each ROI by chance and thus resulted in a smaller than would be the case for a non-GC-rich motif.
Having calculated the mean background motif distance, we proceed to calculate the enrichment contribution (i.e., weight—Eq. (6)) for each ROI in the ordered list (see “Weight Calculation” in Fig. 1).
In order to calculate the E-score, we first generate the enrichment curve for the given TF (solid line in “Enrichment Curve” in Fig. 1) and the background (uniform) enrichment curve (dashed line in “Enrichment Curve” in Fig. 1). We define the E-score as the integrated difference between these two (scaled by a factor of 2, for the purpose of normalization). The enrichment curve (Eq. (7)), which is the normalized running sum of the ROI weights, and the E-score (Eq. (8)) are calculated as follows:
where i is the index for the ROI rank and i/N represents the uniform, background enrichment value for the ith of N ROIs. The background enrichment assumes every ROI contributes an equal weight wi, regardless of its ranking position. Therefore, the enrichment curve (Eq. (7)) will deviate significantly from the background if there is a correlation between the weight and ranked position of the ROIs. In this case, the E-score will significantly deviate from zero, with E > 0 indicating either the increased activity of an activator TF or decreased activity of a repressor TF. Likewise, E < 0 indicates either a decrease in an activator TF or an increase in a repressor TF. By definition, the range of the E-score is −1 to +1.
Unlike GSEA, which uses a Kolmogorov–Smirnov-like statistic to calculate its enrichment score44, the TFEA E-score is an area-based statistic. GSEA was designed to identify if a predetermined, biologically related subset of genes is over-represented at the extremes of a ranked gene list. Therefore, the KS-like statistic is a logical choice for measuring how closely clustered are the elements of the subset since it directly measures the point of greatest clustering and otherwise is insensitive to the ordering of the remaining elements. Conversely, because TFEA’s ranked list does not contain two categories of elements (the ROIs) and all elements can contribute to the E-Score, we wanted a statistic that was sensitive to how all ROI in the list were ranked—for this reason, we chose the area-based statistic. The null hypothesis for TFEA assumes all ROIs contribute equally to enrichment, regardless of their motif co-localization and rank. Hence the uniform background curve, to which the enrichment curve is compared.
In order to determine if the calculated E-score (Eq. (8)) for a given TF is significant, we generate an E-score null distribution from random permutations of . We generate a set of 1000 null E-scores , each calculated from an independent random permutation of the ranked ROIs, . Our E-score statistic is zero-centered and symmetric, therefore we assume . The final E-score for the TF is compared to this null distribution to determine the significance of the enrichment.
Prior to calculating the E-score p value, we apply a correction to the E-score based on the GC-content of the motif relative to that of all other motifs to be tested (user-configurable). This correction was derived based on the observation that motifs at the extremes of the GC-content spectra were more likely to call significant across a variety of perturbations. We calculate the E-Scores for the full set of TFs as well as the GC-content of each motif, {(gi, Ei)}. We then calculate a simple linear regression for the relationship between the two
where and are the average E-score and average GC-content. EGC(g) is the amount of the E-score attributed to the GC-bias for a motif with GC-content g. Thus the final E-score for the TF is given by ETF = E − EGC(gTF), the difference between Eqs. (8) and (10). If GC-content correction is not performed, then Eq. (8) is taken to be the final E-score. The p value for the final TF E-score is then calculated from the Z-score, ZTF = (ETF − E0)/σE.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.