The challenge in weak peak detection of ChIP-seq data lies in the ambiguity in differentiating weak signals of protein binding sites from noise signals produced by the background regions. In ChIP-seq data, signals from the amplified background DNAs can be as strong as true binding signals. ChIP-BIT2 shrinks the distance in read intensity distributions between strong and weak peaks using one global distribution and amplifies the difference between weak peaks and background regions using multiple local distributions. In this way, it brings more power for detecting protein binding sites with different strengths of ChIP-seq read enrichment (Fig. 1a).

ChIP-seq peak detection using a Gaussian mixture model. ChIP-BIT2 a converted read counts to read intensity and then b used a mixture of Gaussian distributions to differentiate (strong and weak) binding events from background signals

To enable flexible detection for narrow or wide peaks, a sliding window is used for peak screening. The window size is adjustable to meet different resolution needs. For example, most TFs have narrow and sharp ChIP-seq peaks. A narrow window size like 50 base pairs (bps) can help identify high-resolution peak boundaries. For HMs, their peaks can be as wide as several thousand bps. A wide window size like 500 bps can effectively smooth signal fluctuation in the wide genomic region of a whole peak.

Given a ChIP-seq profile for a specific protein, assuming there were N candidate genomic regions overlapping with at least two ChIP-seq reads at each, we partitioned the nth region into fixed-length windows and calculated read intensity sn,w for the window w. In the meanwhile, we calculated another read intensity rn,w using data from the matched input ChIP-seq profile. The relative distance of the window w to the nearest gene TSS or enhancer center was denoted by dn,w. ChIP-BIT2 estimated a probability for protein binding occurrence in the window w of the region n as [8]:

Depending on the binding or non-binding status in the variable bn,w (with a uniform prior on binding ‘bn,w= 1’ or non-binding ‘bn,w= 0’), we modeled sn,w a two-component Gaussian mixture distribution as:

If bn,w=1, we assumed the region bound by the protein and modeled read intensities at protein-bound regions using a global Gaussian distribution with mean μ1 and variance σ12, where model parameters μ1 and σ12 were unknown and needed to be estimated. If bn,w=0, we assumed it a background region and modeled the read intensity using a local Gaussian distribution with mean rn,w and variance σ02 (the variance of background signals was estimated using the input ChIP-seq data).

The second likelihood function Pdn,w|bn,w in Eq. (1) modeled the regulatory effects of the selected region on nearby genes. ChIP-seq data visualization around gene promoter regions (Additional file 1: Fig. S1A) and evidence from previous studies [8, 17] both suggest that: for protein binding sites, the ChIP-seq read intensity follows an exponential distribution towards the gene TSS; for background regions, the distribution is relatively uniform around the TSS. Therefore, we modeled dn,w a mixture distribution with two components as follows:

where λ represented the exponential distribution parameter, which was unknown and needed to be estimated. dP represented the length of a promoter region.

For enhancers, ChIP-seq data visualization (Additional file 1: Fig. S1B) shows that the distribution of ChIP-seq read intensity is uniform and does not correlate with the distance to the enhancer center or the nearest TSS. Therefore, specifically for peak calling at distal enhancers, we assumed uniform distributions on dn,w as:

where dE represented the length of an enhancer region.

ChIP-BIT2 used the Expectation–Maximization algorithm to iteratively estimate distribution parameters and the probability of binding occurrence in each window (Fig. 1b). Briefly, in the E-step, ChIP-BIT2 estimated the model parameters based on the inferred binding status variables (bn,w) of all regions; in the M-step, ChIP-BIT2 updated the posterior probability Pbn,w|sn,w,dn,w for each window using the estimated model parameters, and then updated the binding status in the variable bn,w accordingly. We iterated the E and M steps until the changes of parameter values were less than 5%. ChIP-BIT2 combined consecutive windows with probabilities higher than a cut-off threshold and output them together as one single peak. Depending on the protein feature and the window resolution, a sharp peak can take one or two windows and a wide peak can take more than ten windows.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.