GATK Variant Discovery Pipeline

Cheng He; Sanzhen Liu

doi:10.21769/BioProtoc.5068

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Peer-reviewed

GATK Variant Discovery Pipeline

CH Cheng He email

SL Sanzhen Liu email

Published: Sep 20, 2024 DOI: 10.21769/BioProtoc.5068 Views: 346

PDF

Ask a question

How to cite

Favorite

Cited by

Abstract

Phenotypic variations of most biological traits are largely driven by genomic variants. The single nucleotide polymorphism (SNP) is the most common form of genomic variants. Multiple algorithms have been developed for discovering genomic variants, including SNPs, with next-generation sequencing (NGS) data. Here, we present a widely used variant discovery pipeline based on the software Genome Analysis ToolKits (GATK). The pipeline uses whole-genome sequencing (WGS) data as input and includes read mapping, variant calling, and the variant filtering process. This pipeline has been successfully applied to many genomic projects and represents a solution for variant calling using NGS data.

Keywords: Variant discovery

Single nucleotide polymorphism

Next-generation sequencing

Whole-genome sequencing

GATK

Background

As the technology allows regularly producing sequencing data in a high-throughput manner, genome-wide variant discovery becomes routine for many applications, including genotyping of variants for their phenotypic connection. Single nucleotide polymorphisms (SNPs) and small insertion–deletions (indels) are two common forms of genomic variants. The SNP, as the simplest molecular marker, has been widely used to connect genetic variation with phenotypic variation of biological traits, such as human diseases and plant architectures [1,2]. The reference-based SNP calling is the process by which DNA polymorphisms between sequencing reads and the reference are identified based on their alignments [3]. Algorithms and software packages have been developed for SNP calling. Simple SNP calling methods determine read counts at each polymorphic site and set a cutoff of read counts for SNP identification [4]. However, the power of SNP discovery using such methods is subject to sequencing depths. More sophisticated approaches incorporate the uncertainty of SNP calling in a probabilistic framework, such as SOAP2, SAMtools, or Genome Analysis ToolKits (GATK) [5–7]. Among them, GATK is the industry standard for discovering SNPs and small indels using sequencing data from genomic DNA and RNA-seq [7]. Although GATK was originally developed for human sequencing data, it has evolved to handle genome data from other organisms with different levels of ploidy [8]. Here, we present a variant calling pipeline using whole-genome sequencing (WGS) data based on GATK. The pipeline includes reads mapping, variant calling, and variant filtering.

Software and datasets

Software installation and data availability: All software installation instructions and test data have been deposited to GitHub: https://github.com/Bio-protocol/GATK-SNP-Calling

Software

All software packages have been tested on a Linux (CentOS 7, x86_64) operating system.

Anaconda (version 2024.02-1) (https://www.anaconda.com/download)
Trimmomatic (version 0.39, Java 1.8.0+) (http://www.usadellab.org/cms/?page=trimmomatic)
BWA (version 0.7.17) (https://github.com/lh3/bwa)
SAMtools (version 1.9) (http://www.htslib.org/)
GATK4 (version 3.8, Java 1.8.0+) (https://gatk.broadinstitute.org/)
vcftools (version 0.1.16) (http://vcftools.sourceforge.net/)
Perl (version 5) (https://perl.org)
Python (version 3) (https://python.org)

Data sets

Sample data for GATK SNP Calling workflow include:
1. B73/A188.R1/2.fq.gz: Paired-end whole genome sequencing reads for two test samples.
2. B73Ref4.fa: The maize B73 version4 reference genome.
3. TruSeq3-PE.fa: The adaptor sequences for reads trimming.
Sample reads and adaptor sequences were included in the “input” folder. The B73 version4 reference genome can be downloaded from MaizeGDB (https://download.maizegdb.org/Zm-B73-REFERENCE-GRAMENE-4.0/Zm-B73-REFERENCE-GRAMENE-4.0.fa.gz)

Procedure