To identify large copy number variants (CNVs), realigned bam files for each sample were filtered for proper-pairs and PCR or optical duplicates were removed using samtools view (RRID:SCR_002105, v1.3, Li et al., 2009). Coverage was then determined using bedtools genomecov (RRID:SCR_006646, v2.17.0) with parameters: ‘-d -split’ (Quinlan and Hall, 2010). Large duplications and deletion were identified using custom scripts in R (R Development Core Team, 2013): genome coverage was determined for 5 kb non-overlapping windows along the genome and each window was normalized by the haploid chromosome coverage of the respective chromosome and sample (i.e. median chromosome coverage divided by somy of the respective chromosome and sample). Large CNVs were identified through stretches of consecutive windows with a somy-normalized median coverage >= 0.5 or<=−0.5 for duplications and deletions, respectively, a minimum length of 25 kb and a median normalized coverage difference across windows >= 0.9 (Supplementary file 6). To identify large CNVs across samples at identical positions and variant type, we grouped CNVs across samples with identical start and end positions within <= 10 kb (i.e. up to two 5 kb windows difference) (Supplementary file 7). CNVs of individual genes were determined based on the filtered bam files (see genome coverages) with bedtools coverage (RRID:SCR_006646, v2.17.0) using parameters ‘-d -split’ (Quinlan and Hall, 2010) and analysing gene coverages in R (R Development Core Team, 2013). The coverage of each gene was approximated by its median coverage and normalized by the haploid coverage of the respective chromosome and sample (Supplementary file 9).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.