Kraken TCGA microbial-detection pipeline

GP Gregory D Poore
EK Evguenia Kopylova
QZ Qiyun Zhu
CC Carolina Carpenter
SF Serena Fraraccio
SW Stephen Wandro
TK Tomasz Kosciolek
SJ Stefan Janssen
JM Jessica Metcalf
SS Se Jin Song
JK Jad Kanbar
SM Sandrine Miller-Montgomery
RH Robert Heaton
RM Rana Mckay
SP Sandip Pravin Patel
AS Austin D Swafford
RK Rob Knight
request Request a Protocol
ask Ask a question
Favorite

The SevenBridges CGC interface enabled rapid development of the bioinformatic pipeline for this project while ensuring its future reproducibility51. Bioinformatic tools were either loaded directly from the CGC platform (e.g. samtools, BWA) or uploaded and run as separate Docker containers (e.g. QIIME, Kraken) in order to create customized ‘app’ workflows. These ‘app’ workflows take sample BAM files as inputs and labeled which DNA or RNA reads within each sample were “microbial”. These ‘app’ workflows can be publicly shared for reproducibility purposes, as needed. The computational analyses themselves were hosted on Amazon Web Services (AWS; https://aws.amazon.com/) through the CGC interface and most often used AWS’s “x1.16” EC2 compute instance, comprised of the following specifications: 64 vCPU, 174.5 ECU, 976 GB of Memory, and 1920 GB of Instance Storage. The computational wall-time was approximately 6-months using these specifications.

Sequencing reads that did not align to known human reference genomes (based on mapping information in the raw BAM files) were mapped against all known bacterial, archaeal, and viral microbial genomes using the ultrafast Kraken algorithm23. A total of 71,782 microbial genomes were downloaded using RepoPhlan (https://bitbucket.org/nsegata/repophlan) on 14 June 2016, of which 5,503 were viral and 66,279 were bacterial or archaeal. Based on prior literature, bacterial and archaeal genomes were filtered for quality scores of 0.8 or better58, which left 54,471 of them for subsequent analysis, or a total of 59,974 microbial genomes.

As previously described in detail23, the Kraken algorithm breaks each sequencing read into k-mers (we used default 31-mers) and exactly matches each k-mer against a database of microbial k-mers, which was built from the 59,974 microbial genomes described above prior to running the algorithm. The set of exact k-mer matches for a given read, in turn, provides a putative taxonomy assignment of the lowest common ancestor for that read, most accurately to the genus level, to which we summarized our data. The matching and classification operations are orders of magnitude faster than performing direct genome alignments. As a safeguard against false positives and to properly benchmark our pipeline, we took four cancer types (COAD, CESC, OV, LUAD) and aligned the reads Kraken classified as “microbial” to the 59,974 microbial genomes using BWA50, which is computationally more expensive but yields a result with higher specificity and taxonomic resolution (i.e. to species and strain level). The four cancer types that were directly aligned included CESC as a putative positive viral control (i.e. HPV), STAD as a putative positive bacterial control (i.e. H. pylori), and two others (LUAD, OV) based on microbial signatures in the literature and/or available mass-spectrometry proteomic information (to look for microbial proteins; data not shown)5,24,5961. We found that 98.91% of reads classified to genus level or lower by Kraken (on which our main findings are based) also aligned with BWA to the microbial database (bacteria, archaea, viruses; see Table S3), or a false positive rate of 1.09%, suggesting that the genus-level, Kraken-labeled, pan-cancer microbial reads were sufficiently usable for further analyses.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A