Variant calling

CU Claudio Urra
DS Dayan Sanhueza
CP Catalina Pavez
PT Patricio Tapia
GN Gerardo Núñez-Lillo
AM Andrea Minio
MM Matthieu Miossec
FB Francisca Blanco-Herrera
FG Felipe Gainza
AC Alvaro Castro
DC Dario Cantu
CM Claudio Meneses
request Request a Protocol
ask Ask a question
Favorite

The raw sequences were analyzed using FastQC v0.11.7 (Andrews 2010), followed by a coverage standardization of 20×. To do this, 137,372,000 reads were kept from each clone genome in CH, 119,020,000 in SB, 124,600,000 from CS, and 103,685,230 in M clones using the software seqtk v1.3-r106 (https://github.com/lh3/seqtk). Trimming was performed using Trim-galore software v0.5.0 with PHRED quality threshold Q > 25 (Krueger 2012). Each clone genome was mapped to the genome assembly of its cultivar using the primary assembly. The genome mapping was performed with bwa-mem software v0.7.17-r1188 (Li et al. 2008). Before the variant calling process, the mapped genome sequence reads were sorted using Samtools software v1.9 (Li et al. 2009) and prepared with Picard-tools software v2.16.1 using the AddOrReplaceReadGroups, MarkDuplicates, and CleanSam commands (https://broadinstitute.github.io/picard/).

We used GATK HaplotypeCaller v4.0.9.0 (Mckenna et al. 2010) to perform the variant calling of each clone genome using the primary assembly of SB and CH clones (Zhou et al. 2019). In CS, the primary assembly version was the one described by Chin et al. 2016, while in M clones, it was the primary assembly described by Massonnet et al. 2020. Two different variant calling protocols were used: first on each sample individually and second with a joint genotyping step combining all samples following the GATK best practices (available at https://gatk.broadinstitute.org). A variant quality filter of Q > 100 was applied for both protocols. The global distribution of variants detected in all clones was evaluated by a Circos plot (Krzywinski et al. 2009). Variants and gene densities were calculated in 100-kbp windows for plotting. Only variants consistently present in each clone's replicates were used for principal component analysis (PCA). To identify clone-specific variants, we extracted variants that were present in all replicates of a clone and absent in all the other samples.

PCA plots were generated in R v3.5.3 with the R packages factoextra v1-0-5 and FactoMineR v1.4.1. Predicted functional effects were estimated using the software SnpEff v4.3t (Cingolani et al. 2012).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A