Human publicly available WES raw data from SRA are obtained using the following parameters in the SRA search bar: ‘((((illumina[Platform]) AND homo sapiens[Organism]) AND WXS[Strategy]) AND “Homo sapiens”[orgn:__txid9606] AND cluster_public[prop] AND “biomol dna”[Properties])’. The obtained table contains both SRA download accessions and corresponding BioProject and BioSample IDs. Raw sequencing data are downloaded from SRA using sratoolkit.2.11.0 ‘prefetch’ command and then extracted using ‘fastq-dump’ command, including ‘--split-files’ option if data are paired-end sequencing. Raw data are cleaned using Trimmomatic-0.39 (12) and then aligned to hg38 (UCSC version) using Picard and BWA-MEM (13) following GATK 4.2.2.0 pipeline (14) to generate VCF files using ‘HaplotypeCaller’ function. To generate parallel hg19 VCFs, we used Picard’s LiftOverVcf function using UCSC’s hg38tohg19 chain file. Variant effect annotation is done using SnpEff (15) 5.0e. This pipeline is performed using our institutional high-performance computing infrastructure at Ben-Gurion University of the Negev. Output VCF files are uploaded to the AWS S3 bucket in a gzipped format.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.