Quantification and Statistical Analysis

XL Xiaowen Lyu
MR M. Jordan Rowley
VC Victor G. Corces
request Request a Protocol
ask Ask a question
Favorite

Paired-end reads from HiChIP experiments were aligned to the human hg38 reference genome using Juicer (Rao et al., 2014). After PCR duplicates and low-quality reads were removed, high-quality reads were assigned to MboI restriction fragments, filtered for valid interaction PETs, and used to generate binned contact matrix hic files. For visualization and further analysis of HiChIP interaction signals, Vanilla coverage square root (VCsqrt) normalized signal for the interaction matrices were derived using the Juicebox tools dump command. Mango (Phanstiel et al., 2015) was used to call significant interactions with default parameters and FDR cutoff at 0.01 and PETs >=10. HiCCUPS (Rao et al., 2014) was used to call CTCF loops in Rad21 HiChIP for the analysis shown in Figure S3. Mango loops were visualized with the WashU Epigenome browser to obtain arc views of significant loops.

To derive the HiChIP virtual 4C line plots, promoters were used as viewpoints, VCsqrt normalized observed over expected (O/E) contact matrices binned into 10 kb fragments were generated using Juicebox dump tools and used to compute contact frequencies between other genomic bins with each viewpoint. Viewpoints were ranked by distances between promoters and interacting dynamic enhancers from upstream to downstream. For clarity of the results, only interactions within 500 kb up/down-stream were plotted. Virtual 4C signals were analyzed with in house python scripts. Briefly, virtual 4C was performed by taking the HiChIP VCsqrt normalized observed over expected (O/E) contact frequency of each bin interacting with the anchor. To account for intensities at various resolutions, individual bins at high resolution were combined using a sliding window congruent to the resolution step, then the average signal across combined bins was calculated for each resolution step examined until reaching the lowest specified resolution. Final normalized enriched scores shown in Figure 4 were generated by calculating the average intensity of each bin across all resolutions from 10 kb to 500 kb, and plotted using java treeview.

Aggregate peak analysis (APA) metaplots and scores were generated as described (Rao et al., 2014) using 10 kb resolution contact matrices. To measure the enrichment of loops over the local background and normalize for different loop distances and protein occupancy bias, the VCsqrt normalized observed over expected (O/E) contact frequency of pixels of loops as well as the surrounding pixels up to 10 bins away in both x and y directions, i.e. 210 kb*210 kb local contact matrices, were collected. To generate the aggregate heatmaps, the median O/E for each position of all 210 kb*210 kb contact matrices for a set of loops were calculated and plotted using the heatmap.2 R package. APA scores were determined by dividing the center pixel value by the mean value of the 25 (5*5) pixels in the lower right section of the APA plot.

Multiple anchor metaplots shown in Figure 6 were obtained at 10 kb resolution and the distance between anchors was scaled to 10 equal bins. For three CTCF anchors, the anchors were oriented such that the stable anchors are the first ones on the left. To compare libraries from cells subjected to different treatments, the observed interaction matrices were normalized between samples by random picking to obtain equal numbers of PETs for each library. VCsqrt Normalization was then applied to all contact matrices. To compare HiChIP aggregate signal changes between different samples on distinct anchor sets, subtraction or log2 fold changes of treatment versus Ctrl were calculated for each loop separately and then summarized by taking the median values of all anchors for visualization.

For comparison between different HiChIP libraries, the following normalization steps were taken. 1) PETs of each library were randomly picked to match the size of the library with lowest numbers of PETs; 2) distance normalized by observed over expected; 3) VCsqrt normalized to balance protein occupancy bias. The normalized matrices were used to call differential loops between conditions by the following procedures. 1) Mango loops from both conditions were combined; 2) normalized contact frequencies in both conditions were combined pairwise on all resulting combined loops in step 1); 3) pairwise interaction frequencies from step 2) were used as input in the edgeR R package to identify condition-enriched significant differential loops (FDR cutoff < 0.1 and fold change >2).

For analyzing overlaps of ChIP-Seq peaks and loop anchors, Mango loop anchors ± 5kb were used to overlap with ChIP-Seq peaks using the bedtools intersect function. To obtain the expected overlapping ratio, ChIP-Seq peaks were shuffled 1000 times and the same analysis was repeated. Significant p values were derived by numbers of times when observed < expected happens divided by 1000.

All reads were mapped to unique genomic regions using Bowtie and the hg38 genome. PCR duplicates were removed manually. Bedtools genomeCoverage function was used to derive bedgraph files for further analysis. To compare changes in ChIP-Seq signals, libraries were normalized by random picking to obtain the same numbers of reads. Normalized reads were used to derive bedgraph files for comparison in IGV. MACS2 was used to call peaks using default parameters with IgG as control. Differential peaks of H3K27ac and RAD21 were found using the edgeR (Robinson et al., 2010) R package at p<0.05, fold change >=3. CTCF differential peaks were obtained using Manorm (Shao et al., 2012) at p<0.05.

Refseq genes TSS ± 1 kb were defined as promoters. HS and Ctrl enriched promoters correspond to promoters of HS-induced and HS-repressed genes (shown in Figure 1), respectively.

Enhancers were defined by using H3K4me1 peaks without H3K4me3 but overlapping ATAC-seq THSSs peaks, TSS±1b was excluded. Among these enhancers, those overlapping H3K27ac peaks were taken as active enhancers, those overlapping with H3K27me3 peaks were taken as poised enhancers, those overlapping with neither H3K27ac nor H3K27me3 peaks were considered primed enhancers.

Differential enhancers were found using the edgeR R package based on H3K27ac ChIP-Seq signals with a p value cutoff at 0.05 and more than 3-fold changes in either condition.

Super enhancers were identified using ROSE (Whyte et al., 2013) with H3K27ac ChIP-Seq signal as input using default parameters.

To identify Polycomb domains, 10 kb bins with a 1 kb sliding window were used to calculate windows with RPKM > 10 in RING1B ChIP-Seq. Windows selected were merged.

For deriving heatmaps of ChIP-Seq signal, anchors plus flanking regions were binned equally to get a blank matrix (anchors × bins). To compare between samples, reads from the same antibody analysis were normalized by random pick. Normalized read pairs were mapped to each genomic bin with the bedtools intersect function to obtain reads counts in each bin for the whole matrix. To normalize for sequencing depth, values in the matrix were divided by library sizes in millions to obtain reads per million per covered bin (RPMPCG or RPM), which was then visualized with Java treeview to derive heatmaps. To obtain average profiles of the ChIP-Seq data, mean values of bins at the same distances from anchors were calculated and plotted. Clustering of ChIP-Seq heatmaps was done using Cluster3 on center ± 3 bins signals of appropriate heatmaps. K-means clustering was used.

All reads were mapped with Bowtie2 and PCR duplicates were removed by Picard tools with in house python script pipeline. After this, reads were split into two ranges based on sizes: THSSs (transposases hypersensitive sites, bound by TFs, <115bp) and mono-nucleosomes (180~247bp). To obtain the exact positioning of nucleosomes, DANPOS (Chen et al., 2013) was used to derive the nucleosomal signals genome wide by dpos function using 180~247 bp fragments as input and 115 bp fragments as background using -p 1 -a 1 -jd 20 -u 0 -m 1. Reads were normalized between samples before running DANPOS. THSSs bedgraphs were made using the bedtools genomeCoverage function.

Heatmaps and average profile of ATAC-seq THSSs and nucleosomal signals were derived in the same way as ChIP-Seq data. Clustering of ATAC-seq heatmaps was done using Cluster3 with K-means clustering.

All reads were process as described for Bru-Seq (Paulsen et al., 2014). First, all sequence reads were mapped to the human ribosomal DNA complete repeating unit (U13369.1) using Bowtie to obtain non-rRNA reads, then those non-rRNA reads were mapped to the human reference genome assembly hg38 using TopHat2 and default parameters. Only uniquely mapped reads with up to two read segment mismatches were allowed to split between exons in RefSeq transcript annotation, but de novo splice junction calling was not performed, since nascent RNA reads are mainly intronic. Picard was used to remove PCR duplicates. Finally, fragments were normalized between Ctrl and HS for obtaining bedgraph files at fragments per million (FPM) for visualization in IGV. Cufflinks was used to calculate gene FPKM values and cuffdiff was used to calculate p-values for differentially expressed genes. Genes with more than 2-fold changes in FPKM values and p <0.01 in cuffdiff were considered Ctrl or HS enriched genes.

Significant motifs of ChIP-Seq peaks and differential loop anchors were found by MEME. Exact motifs sequences were scanned using FIMO and the JASPAR_CORE_2016_vertebrates database against a set of peaks or anchors to obtain the overlapping percentages.

To analyze the footprints of TFs in ATAC-seq data, motifs on a set of peaks or loop anchors were used as anchors for running dnase_average_profile.py scripts of the Wellington program in ATAC-seq mode. To compare between Ctrl and HS samples, read-normalized ATAC-seq THSSs (<115 bp) fragments were used as input to obtain the footprint average profiles. The footprint p values of all motifs on a set of peaks or anchors were derived using the wellington_footprints.py scripts of the Wellington program in ATAC-seq mode on read-normalized ATAC-seq THSSs (<115 bp) fragments.

CTCF motifs were downloaded from the JASPER database. To find the CTCF motif orientation at loop anchors, loops with only one CTCF binding site overlapping one CTCF motif or multiple CTCF motifs in the same orientation were kept for analysis. The percentage of all four possible pairwise combinations of motifs were calculated on the loop anchors.

For analysis of CTCF motif orientation at loop anchors, the number of forward motifs was divided by the number of reverse motifs and log2 transformed. For three CTCF anchors, since the anchors were oriented such that the stable anchors are the first ones on the left, the CTCF motif orientations were also inverted if the three anchors coordinates were in reverse decreasing order in the genome. Therefore, the CTCF orientation on the plot of three anchors are relative to the expansion or shrinking orientation of CTCF loops.

Changes of Eigenvector or contact frequencies were ranked by values and divided into ten groups. The ChIP-Seq signals changes in each group were then averaged and plotted using the heatmap.2 R package.

To measure the enrichment of contacts between genomic regions, pairwise anchors were random shuffled 100 times to generate 100 random pairwise anchors with comparable distance distribution to the experimental set. The significance p value was determined by counting the number of random control sets with more contacts than the experimental set and divided by the number of experiments (100).

To simulate the extrusion complex starting at CTCF sites, an 11*11 empty matrix was generated, where bin0 indicates F anchor and bin11 indicates R anchor. As the diagram in Figure S7C shows, we start assigning contact counts to the left anchor at bin0 and to the right anchor at bin1, and then keep assigning counts to the left anchor on bin0 and the right anchor on bin2, bin3…bin11, then a loop is formed using the forward CTCF as starting point. In the other direction, we performed the same analysis using the reverse CTCF motif as starting point, so that the right anchor was constantly bin11, and left anchors were gradually decreased from bin10 to bin0. This process was repeated 1000 times. After averaging these two types of matrices, the final simulation matrix was derived.

To simulate the extrusion complex starting at random sites between two convergent F-R CTCF motifs, an 11*11 empty matrix was generated, where bin0 indicates the F anchor and bin11 indicates the R anchor. As the diagram in Figure S7D shows, we generated a random bin number bin(n) as a start of extrusion and assigned this bin as i and j in the matrix[i,j]. Each extrusion step was simulated by matrix[i−1,j+1] with a read added at each step. When i reached a CTCF site on the left, it was left constant while j was increased by matrix[i,j+1]. We generated the random starting bin number n by 50 times iterations and all matrices are summed up. These processes are repeated 1000 times. After averaging 1000 matrices above, the final simulation matrix is derived.

Slopes of Hi-C contact enrichment for locus A-F in Figure 7F were determined by off the diagonal signal determined by iterating through matrix[i, o]/counts[i,i], where o ranges between 1 and 8 as shown in Figure 7G. This was done for the CTCF anchor A, as well as for anchors interior to the domain (B – F).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A