Advanced Search
Published: Sep 20, 2024 DOI: 10.21769/BioProtoc.5068 Views: 217
Abstract
Phenotypic variations of most biological traits are largely driven by genomic variants. The single nucleotide polymorphism (SNP) is the most common form of genomic variants. Multiple algorithms have been developed for discovering genomic variants, including SNPs, with next-generation sequencing (NGS) data. Here, we present a widely used variant discovery pipeline based on the software Genome Analysis ToolKits (GATK). The pipeline uses whole-genome sequencing (WGS) data as input and includes read mapping, variant calling, and the variant filtering process. This pipeline has been successfully applied to many genomic projects and represents a solution for variant calling using NGS data.
Keywords: Variant discoveryBackground
As the technology allows regularly producing sequencing data in a high-throughput manner, genome-wide variant discovery becomes routine for many applications, including genotyping of variants for their phenotypic connection. Single nucleotide polymorphisms (SNPs) and small insertion–deletions (indels) are two common forms of genomic variants. The SNP, as the simplest molecular marker, has been widely used to connect genetic variation with phenotypic variation of biological traits, such as human diseases and plant architectures [1,2]. The reference-based SNP calling is the process by which DNA polymorphisms between sequencing reads and the reference are identified based on their alignments [3]. Algorithms and software packages have been developed for SNP calling. Simple SNP calling methods determine read counts at each polymorphic site and set a cutoff of read counts for SNP identification [4]. However, the power of SNP discovery using such methods is subject to sequencing depths. More sophisticated approaches incorporate the uncertainty of SNP calling in a probabilistic framework, such as SOAP2, SAMtools, or Genome Analysis ToolKits (GATK) [5–7]. Among them, GATK is the industry standard for discovering SNPs and small indels using sequencing data from genomic DNA and RNA-seq [7]. Although GATK was originally developed for human sequencing data, it has evolved to handle genome data from other organisms with different levels of ploidy [8]. Here, we present a variant calling pipeline using whole-genome sequencing (WGS) data based on GATK. The pipeline includes reads mapping, variant calling, and variant filtering.
Software and datasets
Software installation and data availability: All software installation instructions and test data have been deposited to GitHub: https://github.com/Bio-protocol/GATK-SNP-Calling
Software
All software packages have been tested on a Linux (CentOS 7, x86_64) operating system.
Anaconda (version 2024.02-1) (https://www.anaconda.com/download)
Trimmomatic (version 0.39, Java 1.8.0+) (http://www.usadellab.org/cms/?page=trimmomatic)
BWA (version 0.7.17) (https://github.com/lh3/bwa)
SAMtools (version 1.9) (http://www.htslib.org/)
GATK4 (version 3.8, Java 1.8.0+) (https://gatk.broadinstitute.org/)
vcftools (version 0.1.16) (http://vcftools.sourceforge.net/)
Perl (version 5) (https://perl.org)
Python (version 3) (https://python.org)
Sample data for GATK SNP Calling workflow include:
B73/A188.R1/2.fq.gz: Paired-end whole genome sequencing reads for two test samples.
B73Ref4.fa: The maize B73 version4 reference genome.
TruSeq3-PE.fa: The adaptor sequences for reads trimming.
Sample reads and adaptor sequences were included in the “input” folder. The B73 version4 reference genome can be downloaded from MaizeGDB (https://download.maizegdb.org/Zm-B73-REFERENCE-GRAMENE-4.0/Zm-B73-REFERENCE-GRAMENE-4.0.fa.gz)
Procedure
Category
Bioinformatics and Computational Biology
Systems Biology > Genomics > Screening
Plant Science > Plant molecular biology
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Share
Bluesky
X
Copy link