Bioinformatic processing

AW Ann Marie K. Weideman
RW Rujin Wang
JI Joseph G. Ibrahim
YJ Yuchao Jiang
request Request a Protocol
ask Ask a question
Favorite

The SNV calling pipeline used to pre-process the breast cancer (Table S1) and glioblastoma (Table S2) data is illustrated in Figure S1. Master shell scripts that prompt for user input to run each step in the pipeline are available at our open-access Zenodo repository (see Data and code availability).

FASTQ files were aligned to the reference genome using BWA (Li and Durbin, 2009) for bulk WES and STAR (Dobin et al., 2012) for scRNA-seq. Picard was used for SAM to BAM conversion, and then to sort, add read groups, and deduplicate to produce the assembled BAM files. SAMtools (Danecek et al., 2021) was used to filter alignment records in the scRNA-seq data based on BAM flags, mapping quality, or location. For subsequent estimation of bursting kinetics, featureCounts (Liao et al., 2013) was adopted to quantify gene expressions. Summary statistics output by featureCounts and heatmaps of the variant allele frequencies (VAFs) were generated for each sample (Figures S3S6).

To perform joint variant calling for the bulk WES data, Mutect2 (Benjamin et al., 2019) was run on the BAM files to generate per-sample VCF files, and FilterMutectCalls was utilized to apply filters to the raw output of Mutect2. To avoid false positives in identifying SNVs using scRNA-seq due to RNA editing, we restricted somatic SNVs to those identified in gene coding regions (e.g., by bulk WES), followed by stringent quality control (QC) procedures with functional annotations by ANNO-VAR (Wang et al., 2010). Specifically, we kept SNVs that (i) showed PASS from FilterMutectCalls, (ii) retained homozygous 0/0 genotypes from the normal samples, (iii) had only one alternative allele, (iv) had at least 20 total reads in the normal samples, (v) had at least five alternative reads in the bulk cancer samples, (vi) were not reported by the 1000 Genomes Project (Broeckx et al., 2017), (vii) did not reside in segmental duplications, and (viii) had non-NA scores from the LJB database. Positions of the SNVs identified in the bulk DNA-seq data were used to force call SNV coverage in scRNA-seq using SAMtools mpileup (Danecek et al., 2021). In the final QC, the reference and alternative read counts for both data types were extracted from the parsed output.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A