Function 1: Structural variant frequency

SB Surajit Bhattacharya
HB Hayk Barseghyan
ED Emmanuèle C. Délot
EV Eric Vilain
request Request a Protocol
ask Ask a question
Favorite

Variant frequency is one of the most important filtration characteristics for the identification of rare, possibly pathogenic, variants. Because OGM is not sequence based the average SV breakpoint uncertainty is 3.3 kbp [20]. As a result, compared with SNV frequency calculations, frequency estimates for SVs pose greater difficulty, due to the breakpoint variability between “same” structural variants identified by different techniques.

nanotatoR uses 3 external databases: Database of Genomic Variants (DGV) [32], Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER) [33] and Bionano Genomics control database (BNDB). The respective functions are named: DGVfrequency, DECIPHERfrequency and BNDBfrequency. The 3 datasets are accessible through the nanotatoR GitHub repository (https://github.com/VilainLab/nanotatoRexternalDB).

BNDB is provided by Bionano Genomics in a subdivided set of 4 files based on the type of SVs (indels, duplications, inversions, and translocations) for two different human reference genomes (GRCh37/hg19 and GRCh38/hg38). nanotatoR aggregates the variant files of the user-selected reference genome (hg19 or hg38), into a single format (e.g. TXT) used for frequency calculation. This action is performed as part of the function BNDBfrequency with the following input parameters: buildBNInternalDB = TRUE, InternalDBpattern = “hg19” or InternalDBpattern = “hg38”. The following steps are used to calculate the frequency of a query SV in external databases:

Variant-to-variant similarity: Estimating the frequency of a query SV first requires determining whether the variant is the same as the ones found in a database of interest. In order for the SVs to be considered “same”, nanotatoR, by default, checks whether two independent variants of the same type (e.g. deletion) are on the same chromosome, have 50% or greater size similarity, and if the SV breakpoint start and end positions are within 10 kilobase pairs (kbp) for insertions/deletions/duplications and within 50 kbp for inversions/translocations. For example, if there is a deletion on chromosome 1 with a breakpoint start at position chr1:350,000 and end at chr1:550,000 on the reference, all deletion variants in chr1:340,000–560,000 with a size similarity of 50% would be extracted from the database. Similarly, if the variant was an inversion, nanotatoR would search for variants of the same type and on the same chromosome, with a breakpoint start between chr1:300,000 and chr1:400,000 and breakpoint end between chr1:500,000 and chr1:600,000. Currently the 50% size similarity cutoff is not implemented by default for inversions and translocations, as sizes have only started to be provided in the SVcaller output recently; however, users have an option to run the size similarity, and future releases of nanotatoR will perform the size similarity calculations by default.

The percentage similarity parameters (DECIPHER and BNDB functions: input parameter perc_similarity; DGV function: input parameter perc_similarity_DGV) and breakpoint start and end error (DECIPHER and BNG functions insertion, deletion and duplication: input parameter win_indel; DGV function insertion, deletion and duplication: win_indel_DGV; DECIPHER and BNG functions inversion and translocation: win_inv_trans; DGV function inversion and translocation: win_inv_trans_DGV) are modifiable by the user.

Variant size and confidence score: Two additional criteria are implemented to select for high-quality variants in BNDB. Bionano’s SVcaller calculates a confidence score for insertions, deletions, inversions, and translocations. To calculate allele frequency, nanotatoR takes into account the BNDB variants above a threshold quality score of 0.5 for insertions and deletions (indelconf), 0.01 for inversions (invconf) and 0.1 for translocations (transconf). These thresholds can be modified by the user. In addition, nanotatoR filters out SVs below 1 kbp in size to decrease the likelihood of false positive calls [20].

Zygosity: Variants in BNDB are reported as homozygous, heterozygous or “unknown” (DGV and DECIPHER do not report zygosity). This is used to refine frequency calculation for BNDB SVs: nanotatoR attributes an allele count of 2 for homozygous SVs and 1 heterozygous SVs. Currently, nanotatoR overestimates the frequency for variants that overlap with reference database (BNDB) SVs for which the zygosity is unknown by counting the number of alleles as 2. If the query SV matches with multiple variants in the BNDB from the same BNDB sample, nanotatoR counts these as a single variant/sample, with allele count of 2 for homozygous/unknown and 1 for heterozygous matches.

Frequency calculations: for DECIPHER and DGV, SV frequency is calculated by dividing the number of query matched database variants (step 1.1a) by the total number of alleles in the database, i.e. 2x the number of samples, which are diploid, and multiplying with 100 to get percentage frequency (Formula 1).

For BNDB two types of frequency calculations are performed: filtered and unfiltered. For filtered frequency calculations the following criteria must be met: 1.1a; 1.1b; 1.1c. For unfiltered variants frequency calculation only 1.1a and 1.1c criteria are enforced. The resultant number of identified counts is divided by the number of alleles in BNDB (currently 468 for 234 diploid samples). The result is multiplied by 100 to get a percentage (Formula 2).

Output: The output is appended to the original input file in individual columns. For DECIPHER, this consists of a single column termed “DECIPHER_Freq_Perc”. As DGV provides information on number of samples in addition to frequency, nanotatoR prints two columns: “DGV_Count” (with the total number of unique DGV samples containing variants matching the query SV) and “DGV_Freq_Perc” (for the percentage calculated using Formula 1). For the BNDB, in addition to “BNG_Freq_Perc_Filtered”, “BNG_Freq_Perc_UnFiltered”, a third column reports “BNG_Homozygotes” (number of homozygous variants that pass the filtration criteria).

The internal cohort analysis is designed to calculate variant frequency based on aggregation of SVs for samples ran within an institution or laboratory and provides parental zygosity information for inherited variants in familial cases. The function consists of two distinct parts:

Building the internal cohort database: Individual (solo) SMAP files for each of the samples are concatenated to build an internal database (buildSVInternalDB = TRUE), which is stored in the form of a text file. This step creates a unique sample identifier (nanoID) based on a key provided that ensures unique sample ID and encodes family relatedness. The nanoID is written as NR < Family # > . < Relationship #>. For example, the proband in a family of three (trio) would be denoted as NR23.1, with NR23 denoting the family ID and 1 denoting the proband. For the parents of this proband, the nanoID would be NR23.2 for the mother and NR23.3 for the father. Currently, only trio analyses are supported, future updates will include larger family analyses. If multiple projects exist within the same institution and are coded with project-specific identifiers nanotatoR will append the project-specific identifier in front of the nanoID (e.g. Project1_NR23.1 and Project2_NR42.1).

Calculating internal frequency and determining parental zygosity: For singleton analyses, the function internalFrequency_Solo (for both DLE labeling and SVmerge) calculates internal database frequency of queried SVs based on the same principles explained in section 1.1d (Formula 2) for BNDB frequency calculations. However, additional filtration criteria are implemented to increase the accuracy of frequency estimation. SVs overlapping gaps in hg19/hg38 are annotated in the output SMAPs as “nbase” calls (e.g. “deletion_nbase”) and are likely to be false. nanotatoR filters out “nbase”-containing SVs when estimating internal frequency. For duplications, inversions, and translocations nanotatoR evaluates whether chimeric scores “pass” the thresholds set by the Bionano SVcaller during de-novo genome assembly [34] ensuring that SVs that “fail” this criterion are eliminated from internal frequency calculations (Fail_BSPQI_assembly_chimeric_score = “pass” or Fail_BSSSI_assembly_chimeric_score = “pass”) for SVmerge datasets, or (Fail_assembly_chimeric_score = “pass”) for a single-enzyme dataset. Lastly, nanotatoR checks whether the SVs were confirmed with Bionano Variant Annotation Pipeline, which examines individual molecules for support of the identified SV [34] (Found_in_self_BSPQI_molecules = “yes” or Found_in_self_BSSSI_molecule = “yes”) for SVmerge datasets, or (Found_in_self_molecules = “yes”) for a single-enzyme dataset.

For family analyses (duos and trios), the internalFrequencyTrio_Duo function is used to identify parental/control sample zygosity based on the nanoID coding using criteria described in sections 1.1.a/c (note that, here, the default size similarity percentage used is ≥90% as inherited variants are expected to be virtually identical). Zygosity information for the identified variants is extracted and appended into two separate columns (fatherZygosity and motherZygosity). This functionality is available for both SVmerge (merged outputs from 2 enzymes) and single enzyme labeling. SVs with the same family ID as the query are not included in the overall internal frequency calculation as described in the previous paragraph. Five columns: “MotherZygosity”, “FatherZygosity”, “Internal_Freq_Perc_Filtered”, “Internal_Freq_Perc_Unfiltered”, and “Internal_Homozygotes” are appended to each of the annotated input files (nonrelevant fields contain dashes).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A