Reference SV dataset for real data

SK Shunichi Kosugi
YM Yukihide Momozawa
XL Xiaoxi Liu
CT Chikashi Terao
MK Michiaki Kubo
YK Yoichiro Kamatani
request Request a Protocol
ask Ask a question
Favorite

A reference SV dataset corresponding to NA12878 was generated by combining the DGV variant data (the 2016-05-15 version for GRCh37) obtained from the Database of Genomic Variants (http://dgv.tcag.ca/dgv/app/home) with the PacBio SV data identified from the NA12878 assembly generated with long reads [20]. The DGV data contained 1127 DELs (28% of the total DELs) with < 1 kb and 3730 INSs (79% of the total INSs) with < 1 kb or undefined length. We removed these short DELs and INSs from the DGV data because the long read-/assembly-based data covers a higher number of these size ranges of DELs (6550) and INSs (13,131) and is likely to be more reliable than the DGV data. We further removed DELs, DUPs, and INVs with ≧ 95% reciprocal overlap (≧ 90% reciprocal overlap for > 1 kb variants) in the DGV and long read/assembly data, resulting in the removal of 450 variants in total. The merge of both the datasets was conducted by removing shorter ones of overlapped DELs with ≧ 70% reciprocal overlap, resulting in the inclusion of 1671 DELs, 979 INSs, 2611 DUPs, and 233 INVs specific to the DGV SV data. Although there were still many overlaps within this SV data, they were not removed, because we were unable to judge which sites were inaccurately defined SVs. All the SVs < 50 bp, except for INSs, were removed. In addition, a high confidence NA12878 SV set (2676 DELs and 68 INSs) of the svclassify study [80], which has been deposited in GIAB (ftp://ftp-trace.ncbi.nlm.nih.gov//giab/ftp/technical/svclassify_Manuscript/Supplementary_Information), was merged, resulting in inclusion of 248 DELs (7%) and 4 INSs (6%) as nonoverlapping variants. Furthermore, 72 experimentally verified nonredundant INV dataset from the studies with the long reads [20, 81] and the InvFEST database (http://invfestdb.uab.cat) was merged, resulting in inclusion of 41 unique INVs. For the HG00514 SV reference, a minimal 30 bp of HG00514 variants was extracted from nstd152.GRCh37.variant_call.vcf.gz, which was obtained at the NCBI dbVar site (ftp://ftp-trace.ncbi.nlm.nih.gov//pub/dbVar/data/Homo_sapiens/by_study/vcf) (Additional file 1: Table S4). Variants specified as “BND” type were removed, and variants specified as “CNV” were reassigned to both DEL and DUP as SV type. For the HG002 SV reference, a minimal 30 bp of variants was extracted from HG002_SVs_Tier1_v0.6.vcf, which was obtained at the GIAB download site (ftp://ftp-trace.ncbi.nlm.nih.gov//giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6) (Additional file 1: Table S4).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A