Splitting BAM reads into subclones and spiking-in mutations

Adriana Salcedo; Maxime Tarabichi; Shadrielle Melijah G. Espiritu; Amit G. Deshwar; Matei David; Nathan M. Wilson; Stefan Dentro; Jeff A. Wintersinger; Lydia Y. Liu; Minjeong Ko; Srinivasan Sivanandan; Hongjiu Zhang; Kaiyi Zhu; Tai-Hsien Ou Yang; John M. Chilton; Alex Buchanan; Christopher M. Lalansingh; Christine P'ng; Catalina V. Anghel; Imaad Umar; Bryan Lo; William Zou; Jared T. Simpson; Joshua M. Stuart; Dimitris Anastassiou; Yuanfang Guan; Adam D. Ewing; Kyle Ellrott; David C. Wedge; Quaid D. Morris; Peter Van Loo; Paul C. Boutros

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Splitting BAM reads into subclones and spiking-in mutations

AS Adriana Salcedo

MT Maxime Tarabichi

SE Shadrielle Melijah G. Espiritu

AD Amit G. Deshwar

MD Matei David

NW Nathan M. Wilson

SD Stefan Dentro

JW Jeff A. Wintersinger

LL Lydia Y. Liu

MK Minjeong Ko

SS Srinivasan Sivanandan

HZ Hongjiu Zhang

KZ Kaiyi Zhu

TY Tai-Hsien Ou Yang

JC John M. Chilton

AB Alex Buchanan

CL Christopher M. Lalansingh

CP Christine P'ng

CA Catalina V. Anghel

IU Imaad Umar

BL Bryan Lo

WZ William Zou

JS Jared T. Simpson

JS Joshua M. Stuart

DA Dimitris Anastassiou

YG Yuanfang Guan

AE Adam D. Ewing

KE Kyle Ellrott

DW David C. Wedge

QM Quaid D. Morris

PL Peter Van Loo

PB Paul C. Boutros

This method is extracted from research article: Nat Biotechnol, Jan 2020

A community effort to create standards for evaluating tumor subclonal reconstruction

DOI: 10.1038/s41587-019-0364-z

Request a Protocol

Ask a question

Favorite

Read splitting at nodes occurs in a pseudo-random manner using a windowed approach. For each node, let w be every window of reads (set to 1000) and p be the proportions of reads to extract. BAM files are sorted by coordinate using SAMtools sort. For every w paired reads ordered by first read pair coordinate, exactly floor(w × p) paired reads are chosen at random and retained. As compared to a global resampling to the target coverage per node (i.e. setting the window size to the total number of reads aligning to the chromosome), this local sampling accomplishes a less variable coverage across the final chromosome. All extracted reads are merged together using Picard tools, first by phase, then by chromosome, and finally into the tumor BAM. The merged BAM file is then sorted by coordinates, avoiding any possibility to identify from which sub-BAM reads originate.

To complete the final tumor BAM, we further normalize the phases of chromosomes relative to all the phases, based on their individual total fractional copies. For each phase of each chromosome, let p_i be the cellular prevalence and c_i the number of copies at the i^th leaf node. Then C_chr,phase = sum_i (p_i × c_i) represent the total fractional copies. Take M to be the maximum of all CNAs, including tandem duplications, across chromosomes and set this value as the 100% copy proportion. Leaf nodes are down-sampled by taking C_chr,phase / M of the read pool assigned to it. Read pools are adjusted using a bottom-up approach. At each internal node, the cellular copies of its children are summed and the read pool proportions are adjusted (Figure 3).

designatePortions {

if leaf node:

return p_i * c_i / C_chr,phase

else:

quantities = []

quantity_sum = 0

for each child:

quantity[child] = designatePortions{config->child}

quantity_sum += quantity[child]

for each child:

config->child->read_proportion = quantity[child] / quantity_sum

}

If tandem duplications are present, reads that are not incorporated in a node (surplus reads) are down-sampled similarly to provide donor BAMs at the right depth. Surplus reads are down-sampled in proportion to their depth adjusted copy number for a given node, starting with the highest copy number duplications for each node to yield the maximum depth donor bam for each node. If lower copy number duplications exist, these donor BAMs are subsequently down-sampled again in proportion to copy number to yield the lower copy number donor BAMs.

After calculating the per-phase-per-chromosome read pools, BAMSurgeon spikes in mutations given a set number of SNVs, Indels, and SVs into the appropriate read pool before merging them into the final BAM. In Supplementary Note 2 we describe how we spike in mutations compatible with replicating timing, pre-defined tri-nucleotide context spectra and selection.

Altogether, using this approach we achieved a median accuracy of 90.6%, with a median false positive rate of 4.5% and a median false negative rate of 5.92% for the five tumors reported after calling SNVs with MuTect prior to down-sampling.

Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use:http://www.nature.com/authors/editorial_policies/license.html#terms

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol