Motif enrichment of DAPs and DEG regulatory regions

BJ Beryl M Jones
VR Vikyath D Rao
TG Tim Gernat
TJ Tobias Jagla
AC Amy C Cash-Ahmed
BR Benjamin ER Rubin
TC Troy J Comi
SB Shounak Bhogale
SH Syed S Husain
CB Charles Blatti
MM Martin Middendorf
SS Saurabh Sinha
SC Sriram Chandrasekaran
GR Gene E Robinson
request Request a Protocol
ask Ask a question
Favorite

TF motif enrichment analysis in this study was performed similarly to the methods described in Whitney et al., 2014. The overall approach is as follows, with details below. For each TF motif, (1) genomic windows were scored for the presence of the motif, (2) window scores were combined into scores for genomic segments of interest, representing either gene regulatory regions or accessibility peaks, (3) a set of motif targets was created using a fixed cutoff on the segment scores, and (4) a statistical test for enrichment was performed between segments that were motif targets and those that were significant in differential analysis.

First, we divided the honey bee genome (version HAv3.1, NCBI accession GCA_003254395.2) into 500 bp windows with 250 bp shifts. We gathered a collection of 223 representative TFs (Kapheim et al., 2015) and downloaded their DNA binding specificities (motifs) characterized as position weight matrices (PWMs) from FlyFactorSurvey (Zhu et al., 2011). Separately for each TF motif, we ran the Stubb algorithm (Sinha et al., 2003) on all genomic windows to score them for the presence of that TF’s binding sites. Tandem repeats in the windows were masked using the Tandem Repeat Finder (Benson, 1999) before calculating the Stubb scores to avoid scoring the repeats as weak binding sites. Since the honey bee genome has significant local G/C heterogeneity (Sinha et al., 2006), we converted the raw Stubb scores for each window into G/C content-normalized empirical p-values. This was done by determining the rank of each window among all genomic windows of similar G/C content (when grouped into 20 G/C bins).

We defined two different collections of genomic segments (accessibility peaks and gene regulatory regions) to analyze with motif enrichment in this study. Since the genomic segments may overlap with a variable number of our genomic windows, we defined a length-adjusted motif score for each segment. This score was calculated using the score of the best scoring window in that segment for the given motif and the number of windows overlapping the segment, as follows: scseg = 1 – (1 - pvalbest)N where, scseg = length-adjusted motif score for the segment, N = number of windows that overlap with the scoring window, and pvalbest = best G/C normalized empirical p-value among the N overlapping windows.

TF enrichment was analyzed for two sets of regions: DAPs (Differentially Accessible Peaks) and DEGs (Differentially Expressed Genes) (Supplementary file 7).

For analysis of DAPs, the collection of genomic segments was defined as the combination of all DAPs and randomly selected non-accessible parts of genome that had the same distribution of lengths as those DAPs. The number of randomly selected genomic segments was set to 10 times the number of DAP segments. For each motif, the top 200 scoring segments from the collection were defined as the TF motif target set. Hypergeometric p-values were calculated for each motif-DAP set pair (Supplementary file 7) to quantify the significance of the overlap between the corresponding TF motif target set and DAP set.

For DEGs, the collection of genomic segments was the regulatory regions of all genes in the honey bee annotation. Each regulatory region was defined as 5 kb upstream to 2 kb downstream of the transcriptional start site of its gene (http://veda.cs.uiuc.edu/beeMotifScores/). The top 500 scoring segments from the gene universe were selected as the TF motif target set for each motif. Finally, the significance of the overlap for each motif-DEG set pair (Supplementary file 7) was calculated with the Hypergeometric p-value.

All p-values were then converted to q-values using the ‘qvalue’ function in the R software package qvalue (Storey et al., 2019) to control the false discovery rate from multiple hypothesis testing.

For motifs enriched both within DAPs and DEG upstream regions, CentriMo (Bailey and Machanick, 2012) from MEME Suite was used to calculate and plot the probability of motif binding across 2 kb windows centered on the peak summit for DAPs and 7 kb windows (5 kb upstream and 2 kb downstream of the transcriptional start site (TSS)) for DEGs. These probabilities are shown in Figure 3B-C.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A