Data processing

Xiaowen Lyu; M. Jordan Rowley; Michael J. Kulik; Stephen Dalton; Victor G. Corces

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Data processing

XL Xiaowen Lyu

MR M. Jordan Rowley

MK Michael J. Kulik

SD Stephen Dalton

VC Victor G. Corces

This method is extracted from research article: Nat Commun, Oct 2023

Regulation of CTCF loop formation during pancreatic cell differentiation

DOI: 10.1038/s41467-023-41964-6

Request a Protocol

Ask a question

Favorite

All reads were mapped to unique genomic regions using Bowtie2 and the hg38 human genome release. PCR duplicates were removed using Picard Tools (http://picard.sourceforge.net; https://broadinstitute.github.io/picard/). The Bedtools Genome Coverage function was used to derive bedgraph files for further analysis. To compare changes in ChIP-seq signals, libraries were normalized by random picking to obtain the same numbers of mapped reads. Normalized reads were used to derive bedgraph files for comparison in IGV. MACS2 was used to call peaks using default parameters with IgG ChIP-seq data as control. Differential peaks were found using the edgeR^⁸⁶ R package at p < 0.05, changes over 20% up/down. H3K9me3 changes during differentiation were evaluated in 25 kb bins where positive H3K9me3 bins had an IP / Input signal >= 4 and differential bins were identified by at least a four-fold change. Significant TF motifs present at ChIP-seq peaks and at differential loop anchors were found using MEME. Exact motif sequences were scanned using FIMO and the JASPAR_CORE_2016_vertebrates database against a set of peaks or anchors to obtain the overlapping percentages.

RNA-seq raw reads were aligned using HISAT2 v2.2.0 to the hg38 human genome with default parameters. Transcripts per million (TPM) counts for all annotated human genes and transcripts were calculated using StringTie v2.1.6. Differentially expressed genes were identified using the R package edgeR with a cut-off p-value ≤ 0.05 and fold change over 20% up/down.

Paired-end reads from Hi-C experiments were aligned to the human hg38 reference genome using HiC-Pro v2.10.0. After removal of PCR duplicates and low-quality reads, high-quality reads were assigned to DpnII restriction fragments, filtered for valid interaction contacts, and used to generate binned contact matrix hic files^{⁸⁷,⁸⁸}. For visualization and further analysis of Hi-C contact maps, Knight-Ruiz (KR) normalized signal for the interaction matrices were derived using the Juicebox tools dump command^⁸⁷. SIP v1.3.3 (https://github.com/PouletAxel/SIP/releases) was used to call CTCF loops in the Hi-C interaction matrix^⁵²,⁸⁸. Fit-Hi-C (https://github.com/ay-lab/fithic)^⁸⁹ was used to call significant interactions at 5, 10, and 25 kb resolution with a q value cutoff of > 0.001 and merged together for analyses.

Paired-end reads from HiChIP and MiChIP experiments were aligned to the human hg38 reference genome using HiC-Pro v2.10.0. After PCR duplicates and low-quality reads were removed, high-quality reads were assigned to DpnII restriction fragments, filtered for valid interaction contacts, and used to generate binned contact matrix hic files. For visualization and further analysis of HiChIP and MiChIP contact maps, vanilla coverage square root (VCsqrt) normalized signal for the interaction matrices were derived using the Juicebox tools dump command^⁸⁷. FitHiChIP (https://ay-lab.github.io/FitHiChIP/)^⁹⁰ was used to generate singleton reads resembling ChIP-seq data for finding genomic targets of specific proteins, and to call significant interactions with default parameters with an FDR cutoff at 0.05 for finding long-range contacts associated with specific proteins.

Hi-C and HiChIP contact matrices were processed using the Juicer pipeline^⁸⁸. For downstream analysis, matrices were distance normalized via the formula (observed−expected)/(expected + 1). Comparison of Hi-C or HiChIP was done on distance normalized reads from matrices randomly sampled to contact the same total Hi-C contacts between samples. Traditional A/B compartments were identified through the eigenvector of the Pearson correlation matrix at 25 kb resolution as described^⁹¹. Candidate CTCF loops in each sample were identified using SIP^⁵² at 5 kb and 10 kb resolution from which a total master list of potential loops in any sample was created. Loop calling parameters for SIP were as follows: -norm KR -min 2.0 -max 2.0 -mat 2000 -d 6 -res 5000 -sat 0.01 -t 2500 -nbZero 6 -factor 1 -fdr 0.05 -del true -cpu 48 -factor 4. For comparisons between different Hi-C libraries, the following normalization steps were taken, (1) valid contacts from each library were randomly picked to match the size of the library with the lowest numbers of contacts; (2) KR normalization was applied to obtain the balanced matrices. (3) Matrices were then distance normalized by the formula (observed−expected)/(expected + 1). The normalized matrices were then used to call differential loops of all stages using the following approach: (1) loops obtained using SIP from all stages were combined; (2) KR and distance normalized contact frequencies in all stages were combined pairwise on all resulting combined loops in step 1; (3) pairwise interaction frequencies from step 2 were used as input in the edgeR R package to identify significant differential loops for each stage (FDR cutoff <0.1, p-value < 0.05 and fold change ≥4). To further identify stage specific loops, the following steps were taken, (1) all differential loops between all stages were combined; (2) contact frequencies were calculated for all stages; (3) loops were ranked by the stages when their contact frequencies reach the maximum (for gained loops) or minimum value (for lost loops); (4) loops were allocated to each stage when they reach maximum changes based on step 3 and defined as stage specific loops. To find common loops, combined SIP loops with FDR cutoff ≥ 0.1, p-value≥ 0.05 or fold change < 4 in edgeR were excluded from stage specific differential loops and defined as common loops. Metaplots of loops and the surrounding 100 kb were calculated using SIPMeta with Manhattan distances^⁵². Meta scores were calculated by the intensity of the center bin divided by the median signal of the four bins in the top right corner, similar to APA analysis^⁵¹. Changes to interactions in the proximity of the loop were calculated by measuring differences in average signal in metaplots for the top left corner (category 1, inside-outside left of loop), the bottom right corner (category 2, inside-outside right of loop), and the top right corner (category 3, crossing over the loop). Motifs enriched in the anchors of increased loops were identified by MEME-ChIP using the summits of overlapping ATAC-seq peaks. Profiles across motifs were performed by randomly sampling reads to have the same number between samples and using ngsplot. Significant interactions were obtained via Fit-Hi-C for Hi-C data and FitHiChIP for HiChiP or MiChIP data in 10 kb bins.

Aggregate peak analysis (APA) metaplots and scores were generated as described^⁵¹ using 10 kb resolution contact matrices. To measure the enrichment of loops over the local background and normalize for different loop distances and protein occupancy bias, we collected the VCsqrt normalized observed over expected (O/E) contact frequency of pixels of loops as well as the surrounding pixels up to 10 bins away in both x and y directions i.e., 210 kb*210 kb local contact matrices. The median O/E for each position of all 210 kb*210 kb contact matrices for a set of loops were calculated and plotted using the heatmap.2R package to generate the aggregate heatmaps. APA scores were determined by dividing the center pixel value by the mean value of the 25 (5*5) pixels in the lower right section of the APA plot.

Multiple anchor metaplots were obtained at 10 kb resolution and the distance between anchors was scaled to 10 equal bins. For three CTCF anchors, the anchors were oriented such that the stable anchors are the first ones on the left. To compare libraries from cells subjected to different treatments, the observed interaction matrices were normalized between samples by random picking to obtain equal numbers of contacts for each library. VCsqrt normalization was then applied to all contact matrices. To compare HiChIP aggregate signal changes between different samples on distinct anchor sets, subtraction or log2 fold changes of treatment versus control were calculated for each loop separately and then summarized by taking the median values of all anchors for visualization.

For analyzing overlaps between ChIP-seq peaks and loop anchors, Mango^⁹² loop anchors ± 5 kb were used to overlap with ChIP-seq peaks using the bedtools intersect function. ChIP-seq peaks were shuffled 1000 times and the same analysis was repeated to obtain the expected overlapping ratio. Significant p-values were derived by numbers of times when observed <expected happens divided by 1000.

Enhancers were defined by using H3K4me1 peaks without H3K4me3 but overlapping ATAC-seq-TF peaks and excluding TSS ± 1 bp. Among these enhancers, those overlapping H3K27ac peaks were defined as active enhancers. Differential enhancers were found using the edgeR R package based on H3K27ac ChIP-seq signals with a p-value cutoff at 0.05 and more than 3-fold changes in either condition.

For deriving heatmaps of ChIP-seq signal, anchors plus flanking regions were binned equally to get a blank matrix (anchors × bins). To compare between samples, reads from the same antibody analysis were normalized by random pick. Normalized read pairs were mapped to each genomic bin with the bedtools intersect function to obtain read counts in each bin for the whole matrix. To normalize for sequencing depth, values in the matrix were divided by library sizes in millions to obtain reads per million per covered bin (RPMPCG or RPM), and the result was then visualized with Java treeview to derive heatmaps. Average profiles of the ChIP-seq or ATAC-seq data were calculated and plotted using mean values of bins at the same distances from specific anchors. K-means clustering of ChIP-seq heatmaps was done using Cluster3 on center ± 3 bins signals of the appropriate heatmaps.

ATAC-seq data was processed using an in-house pipeline. First, paired end reads were aligned to the hg38 human reference genome using Bowtie2 with default parameters except -X 2000 -m 1. PCR duplicates were removed using Picard Tools (http://picard.sourceforge.net; https://broadinstitute.github.io/picard/). To adjust for fragment sizes, reads mapped to + strands and -strands were offset by +4 bp and -5 bp respectively. For all ATAC-seq datasets, sub-nucleosome and mono-nucleosome reads were separated based on fragment sizes ATAC-TF 50-115 bp and ATAC-Nuc 180–247 bp. To obtain the exact positioning of nucleosomes, DANPOS^⁹³ was used to derive the nucleosomal signals genome wide using the dpos function and 180–247 bp fragments as input and 115 bp fragments as background using -p 1 -a 1 -jd 20 -u 0 -m 1. Reads were normalized between samples before running DANPOS. Bedgraphs were made using the bedtools genomeCoverage function. MACS2 was used to call peaks for ATAC-TF reads, which are bound by transcription factors. Heatmaps and average profiles of ATAC-TF and ATAC-Nuc signals were derived as described for ChIP-seq data. Clustering of ATAC-seq heatmaps was done using Cluster3 with K-means clustering. To analyze the footprints of TFs in ATAC-TF data, motifs on a set of peaks or loop anchors were used as anchors for running dnase_average_profile.py scripts of the Wellington program in ATAC-seq mode. To compare between samples, read-normalized ATAC-TF fragments were used as input to obtain the footprint average profiles. The footprint p-values of all motifs on a set of peaks or anchors were derived using the wellington_footprints.py scripts of the Wellington program in ATAC-seq mode on read-normalized ATAC-TF fragments.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol