Data

NM Nicole E. Mealey
DO Dylan E. O’Sullivan
JP Joy Pader
YR Yibing Ruan
EW Edwin Wang
MQ May Lynn Quan
DB Darren R. Brenner
request Request a Protocol
ask Ask a question
Favorite

Both clinical and genomic data were obtained from The Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma project (TCGA-BRCA, dbGaP study accession = phs000178) [5]. Clinical data files, and simple somatic mutation files in Variant Call Format (VCF) were obtained for all 1044 cases where simple somatic mutation data were available. WES had a minimum of 70% coverage at 20x depth [22]. VCF files were based on WES and produced via the MuTect2 workflow [23]. These files were downloaded using the Genomic Data Commons Data Transfer Tool [24] on June 26, 2018. PAM50 tumour subtypes were retrieved from cBioPortal [22] on January 17, 2019 and from Supplementary Table 1 of a paper by Ciriello et al. [25].

Clinical files were sorted by the patient age listed in them, and then matched with the corresponding mutation files from the same patient. An age cut-off of 40 years was used to divide patients into young and older age groups, as used previously in literature [2, 20]. For a series of our analyses we also compared patients 40 years of age and younger to patients over 60 years of age. Cases were only included if there were both somatic mutation and clinical data available, including the patient’s age. For tumour type analyses, cases without available PAM50 tumour type data were excluded. TCGA included some breast cancer patients with multiple VCF files, indicating that more than one tumour sample was submitted for the same patient, possibly from different parts of the tumour which may be heterogeneous. Analyses included only the first VCF file listed in the downloaded directory. Somatic mutations in the VCF files were filtered to remove any variants that did not pass all quality filters applied by the MuTect2 algorithm, and insertions and deletions, leaving only high quality SNVs.

Having a sufficiently high number of mutations is of particular importance for identifying flat mutational signatures [26]. Samples with a low number of mutations tend to have a higher sum of squared errors of prediction between the original and calculated mutational spectra [26]. Low mutation number may also lead to overfitting, and the identification of spurious signatures. Therefore, cases were excluded from the mutation type and mutational signatures analyses if they contained too few mutations, defined in this study as fewer than 40 mutations [27]. These cases were retained for the analyses of mutated genes and mutational load.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A