# Also in the Article

ChIP-BIT algorithm
This protocol is extracted from research article:
ChIP-BIT2: a software tool to detect weak binding events using a Bayesian integration approach
BMC Bioinformatics, Apr 15, 2021;

Procedure

The challenge in weak peak detection of ChIP-seq data lies in the ambiguity in differentiating weak signals of protein binding sites from noise signals produced by the background regions. In ChIP-seq data, signals from the amplified background DNAs can be as strong as true binding signals. ChIP-BIT2 shrinks the distance in read intensity distributions between strong and weak peaks using one global distribution and amplifies the difference between weak peaks and background regions using multiple local distributions. In this way, it brings more power for detecting protein binding sites with different strengths of ChIP-seq read enrichment (Fig. 1a).

ChIP-seq peak detection using a Gaussian mixture model. ChIP-BIT2 a converted read counts to read intensity and then b used a mixture of Gaussian distributions to differentiate (strong and weak) binding events from background signals

To enable flexible detection for narrow or wide peaks, a sliding window is used for peak screening. The window size is adjustable to meet different resolution needs. For example, most TFs have narrow and sharp ChIP-seq peaks. A narrow window size like 50 base pairs (bps) can help identify high-resolution peak boundaries. For HMs, their peaks can be as wide as several thousand bps. A wide window size like 500 bps can effectively smooth signal fluctuation in the wide genomic region of a whole peak.

Given a ChIP-seq profile for a specific protein, assuming there were $N$ candidate genomic regions overlapping with at least two ChIP-seq reads at each, we partitioned the $n$th region into fixed-length windows and calculated read intensity $sn,w$ for the window $w$. In the meanwhile, we calculated another read intensity $rn,w$ using data from the matched input ChIP-seq profile. The relative distance of the window $w$ to the nearest gene TSS or enhancer center was denoted by $dn,w$. ChIP-BIT2 estimated a probability for protein binding occurrence in the window $w$ of the region $n$ as [8]:

Depending on the binding or non-binding status in the variable $bn,w$ (with a uniform prior on binding ‘$bn,w=$ 1’ or non-binding ‘$bn,w=$ 0’), we modeled $sn,w$ a two-component Gaussian mixture distribution as:

If $bn,w=1$, we assumed the region bound by the protein and modeled read intensities at protein-bound regions using a global Gaussian distribution with mean $μ1$ and variance $σ12$, where model parameters $μ1$ and $σ12$ were unknown and needed to be estimated. If $bn,w=0$, we assumed it a background region and modeled the read intensity using a local Gaussian distribution with mean $rn,w$ and variance $σ02$ (the variance of background signals was estimated using the input ChIP-seq data).

The second likelihood function $Pdn,w|bn,w$ in Eq. (1) modeled the regulatory effects of the selected region on nearby genes. ChIP-seq data visualization around gene promoter regions (Additional file 1: Fig. S1A) and evidence from previous studies [8, 17] both suggest that: for protein binding sites, the ChIP-seq read intensity follows an exponential distribution towards the gene TSS; for background regions, the distribution is relatively uniform around the TSS. Therefore, we modeled $dn,w$ a mixture distribution with two components as follows:

where $λ$ represented the exponential distribution parameter, which was unknown and needed to be estimated. $dP$ represented the length of a promoter region.

For enhancers, ChIP-seq data visualization (Additional file 1: Fig. S1B) shows that the distribution of ChIP-seq read intensity is uniform and does not correlate with the distance to the enhancer center or the nearest TSS. Therefore, specifically for peak calling at distal enhancers, we assumed uniform distributions on $dn,w$ as:

where $dE$ represented the length of an enhancer region.

ChIP-BIT2 used the Expectation–Maximization algorithm to iteratively estimate distribution parameters and the probability of binding occurrence in each window (Fig. 1b). Briefly, in the E-step, ChIP-BIT2 estimated the model parameters based on the inferred binding status variables ($bn,w$) of all regions; in the M-step, ChIP-BIT2 updated the posterior probability $Pbn,w|sn,w,dn,w$ for each window using the estimated model parameters, and then updated the binding status in the variable $bn,w$ accordingly. We iterated the E and M steps until the changes of parameter values were less than 5%. ChIP-BIT2 combined consecutive windows with probabilities higher than a cut-off threshold and output them together as one single peak. Depending on the protein feature and the window resolution, a sharp peak can take one or two windows and a wide peak can take more than ten windows.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A