We excluded indels and multi-allelic variants, and categorized the remaining variants as common (allele frequency ≥ 0.005) or rare (allele frequency < 0.005), and as coding or noncoding based on protein-coding exons from Ensembl 9492. Variant counts were analysed across 2,739 non-empty (that is, with at least one variant) contiguous 1-Mb chromosomal segments, and counts in segments at the end of chromosomes with length L < 106 bp were scaled up proportionally by the factor 106 × L−1. For each segment, the coding proportion, C, was calculated as the proportion of bases overlapping protein-coding exons. The distribution of C is fairly narrow, with 80% of segments having C ≤ 0.0195, 99% of segments have C ≤ 0.067 and only 3 segments having C ≥ 0.10. Owing to the significant negative correlation between C and the number of variants in a segment, and potential mapping effects, we use linear regression to adjust the variant counts per segment according to the model count = β × C + A + count_adj, where A is the proportion of segment bases overlapping the accessibility mask (Supplementary Information 1.5). Unless otherwise noted, we present analyses and results that use these adjusted count values.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.