We followed the method of Ruderfer et al.46, to estimate genic dosage sensitivity scores using counts of exon-altering deletions and duplications in a combined callset comprising the 14,623 sample pan-CCDG callset plus 3,172 non-redundant samples from the B37 callset. Build37 CNV calls were lifted over to build38 as BED intervals using crossmap (v0.2.1)61. We determined the counts of deletions and duplications intersecting coding exons of principal transcripts of any autosomal gene. In Ruderfer et al.46, the expected number of CNVs per gene was modeled as a function of several genomic features (GC content, mean read depth, etc.), some of which were relevant to their exome read-depth CNV callset but not to our WGS-based breakpoint mapping lumpy/svtools callset. In order to select the relevant features for prediction, using the same set of gene-level annotations as in Ruderfer et al.46, we restricted to the set of genes in which fewer than 1% of samples carried an exon-altering CNV, and used l1-regularized logistic regression (from the R glmnet package62, v2.0-13), with the penalty chosen by 10-fold cross-validation. The selected parameters (gene length, number of targets, and segmental duplications) were then used as covariates in a logistic regression-based calculation of per-gene intolerance to DEL and DUP, similar to that described in Ruderfer et al.46. For deletions (or duplications, respectively), we restricted to the set of genes with <1% of samples carrying a DEL, to estimate the parameters of the logistic model. We then applied the fitted model to the full set of genes to calculate genic CNV intolerance scores as the residuals of the logistic regression of CNV frequency on the genomic features, standardized as z-scores and with winsorization of the lower 5th percentile.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.