We built gkm-SVM models following our previously established pipeline with minor modifications.26,27 For each high-quality sample as determined by gkmQC, we defined the positive training set as follows: starting from the top 100,000 open chromatin regions (ranked by their MACS2 p values obtained from our optimized pipeline described above), we removed from the training set peaks with >1% of N-bases, >70% of repeats, and commonly open regions (defined as regions active in at least 30% of samples across all ENCODE datasets), as previously described.12,27 We further restricted open chromatin regions to overlapping H3K27ac peaks from the same tissue (Table S5). As a negative training set, we used an equal number of random genomic regions, matched for length, GC content and repeat fraction. To prevent potential bias caused by variable sequence length, we used 600bp fixed-length regions as a training set by extending ±300bp from peak summits. We used LS-GKM26 software for training with l = 11, k = 7, d = 3, and t = 4 (weighted-gkm kernels). For each sample, we averaged ten different models with different random samplings of negative training sets. After training, we combined the models from different samples (i.e., biological replicates) to generate one model per tissue. Training sets and final models are provided in Data S5.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.