Coolpup.py is a versatile tool that uses .cool files as the main input together with a bed (chrom, start, end) or pair bed ("bedpe": chrom1, start1, end1, chrom2, start2, end2) file to define the regions under investigation. The tool is implemented as a python package which parses all arguments via argparse, performs the computation and saves the output file(s). It leverages the scientific python environment, taking advantage of numpy (van der Walt et al., 2011), scipy (Virtanen et al., 2020) and pandas (McKinney, 2010). A separate CLI tool included in the package (plotpup.py) can be used to visualize the results and uses matplotlib (Hunter, 2007). The code is available on GitHub (https://github.com/Phlya/coolpuppy) and the package can be installed using pip, which then makes coolpup.py and plotpup.py available in the command line. Alternatively, all main functions can be accessed directly from python.
The overall procedure for piling up a lot of small regions is the following. To minimize the number of file reads (at the cost of required computer memory), a sparse representation of each chromosome’s Hi-C contact matrix is loaded into memory. Then, using an iterator, each required location (on- or off-diagonal) is individually retrieved to generate a corresponding submatrix from the data [with some specified padding around the centre of the region of interest (ROI)], and added to the matrix of the same shape, initialized with zeros, while keeping track of the number of summed-up regions. If specified, coverage of the window on each side is recorded. Similarly, if needed, the window (and the coverage) is rescaled to a required shape. This is done for all chromosomes (optionally, in parallel using multiprocessing), and then all of the results are summed and then divided by the total number of windows. If specified, coverage normalization is applied at this stage. Then, unless otherwise specified, a normalization to remove the distance-dependency of contact probability is applied. In most cases, the best and most efficient way is to use a (chromosome-wide) expected value for each diagonal of the matrix, which can be obtained for a cooler file using, for example, cooltools compute-expected. With the assumption that the probability of interactions only depends on distance, the whole-chromosome expected matrix is a diagonal-constant matrix A with diagonal values d (also known as a Toeplitz matrix), such as: The simplicity of this expected model allows trivial creation of a matrix containing expected values for an arbitrary region of the intra-chromosomal Hi-C map without generating the whole matrix to avoid high-memory requirements, which is done for each ROI. All expected matrices are averaged to generate a normalizing matrix. Alternatively, if the expected values are not available, for example, for single-cell Hi-C data, this normalization can be performed using randomly shifted control regions. In that case, to generate the normalizing matrix, the whole pile-up procedure is repeated, but the coordinates are randomly shifted. In the end, the resulting matrix of averaged ROIs is divided element-wise by the normalizing matrix to remove effects of distance.
If not specified, balanced data with chromosome-wide expected normalization were used when creating pile-ups, except for the zygote and single-cell Hi-C datasets, where randomly shifted controls and coverage normalization were used instead. For the single-cell Hi-C (Nagano et al., 2017) analysis, we only used pairs of convergent CTCF peaks within 100–800 kb of each other, since previous analysis (data not shown) indicated this as the distance range where CTCF-associated loops are most prominent; to reduce noise, RING1B-associated interactions were analysed for all distances above 100 kbp.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.