Bioinformatic Analysis

Suzanne Rohrback; Sarah Munchel; Fiona Kaper

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

Bioinformatic Analysis

SR Suzanne Rohrback

SM Sarah Munchel

FK Fiona Kaper

Last updated date: Nov 24, 2020 Views: 1004 Forks: 0

An abbreviated version of this protocol was published in Science Translational Medicine in Jun, 2020

Circulating transcripts in maternal blood reflect a molecular signature of early-onset preeclampsia

Download PDF

Ask a question

How to cite

Favorite

Command Line Processing of Sequencing Reads

The following list describes of the key steps to follow for command line processing of C-RNA sequencing data. Table 1 specifies the exact software used for each processing step. Multiple software packages and versions exist for many of these functions, and these will typically produce comparable results. However, it is beyond the scope of this document to delineate all possible options, or their corresponding caveats.

1. Demultiplex: generate fastq files for each sample from raw sequencing data.

a. If a given library is sequenced on multiple sequencing runs, the fastq files must be concatenated before proceeding

2. Downsample: randomly subsample a specified number of fastq reads

a. 50M reads (maximum) was used for this publication. For this protocol, we recommend processing at least 20M reads per sample. Higher sequencing depths can reduce technical noise, though there appear to be diminishing returns with >100M reads per sample.

3. Filter Abundant Sequences: Remove sequencing reads from common contaminants

a. Important: Save the remaining (non-abundant) reads in fastq format with the “--un” option. These files are the input in step 4.

b. Use a reference index containing sequences for undesired sequences such as Illumina adapters, chrM, human ribosomal and 5S DNA, phage phiX174, polyA and polyC. The appropriate composition of this reference may depend on experimental details

4. Map to Reference Genome: Identify where in the genome (and/or transcriptome) each sequencing read originates from.

a. Bowtie2 (1) must be installed; we used v2.2.3.

5. Sort Mapped Reads: Lexicographical sorting of sequencing reads by mapping coordinates.

6. Index Mapped Reads: Generate an index file used in step 7.

7. Count Reads per Transcript:

a. Ensure that the transcriptome reference file is in GTF format, and from the same genome assembly as used for mapping in step 4.

Table 1. Command line software used for data processing.

Processing Step	Software	Sub-Function*	Version	Recommended Command Line Options*
1	Bcl2fastq2	-	v2.20.0.422	--barcode-mismatches 0 --ignore-missing-bcls --ignore-missing-filter --ignore-missing-positions
2	Seqtk (2)	sample	v1.2-r102-dirty	-2
3	Bowtie (3)	-	v1.0.0	-k 1 -n 0 -l 25 --mapq 10
4	TopHat2 (4)	-	v2.0.13	--no-coverage-search --no-novel-indels --b2-fast
5	Picard (5)	ReorderSam	v1.93	-
6	Samtools (6)	index	v0.1.19	-
7	Subread (7)	featureCounts	v1.4.6	-t exon -g gene_id -Q 10 --primary -p

* * - indicates the category is not applicable for the given processing step

Dataset Quality Control (QC) Assessments

QC checks are crucial to have confidence in the quality of datasets – and therefore the conclusions drawn from them. However, which metrics are most informative and what thresholds to set are likely dependent on a number of experimental variables, including sample type and quantity, C-RNA extraction, enrichment, and sequencing library preparation protocols.

While recommendations for universal QC requirements are not yet possible, we strongly advise any users to examine the quality of all datasets prior to downstream analyses. Commonly useful tools and data checks are listed below.

· The percent of reads excluded in processing step 3 (filtering abundant sequences). If this value is large, the specific sequence(s) represented may help troubleshoot assay performance issues.

· Mapping rates (processing step 4). Samples with abnormally low values warrant further examination.

· Software packages

o FastQC (8): provides a variety of useful measurements about sequencing run quality.

o Preseq (9): estimates library complexity, abnormally high or low values warrant further investigation.

o RSeQC (10): provides a variety of useful measurements about RNAseq data quality. Gene body coverage and transcript abundance saturation are particularly informative.

o BLAST (11): running BLAST on a selection of unmapped reads can confirm or rule out unexpected contaminations.

Group Comparison Analyses

Much more characterization is needed from a wide range of applications, populations, and preparations before universal recommendations can be made for how best to generate biological interpretations from C-RNA sequencing data; and our approaches used in the provided code may or may not be optimized for a different dataset.

We developed custom analyses to address a specific challenge: high biological variability. Even with relatively large sample sizes, we observed inconsistent results when using standard tools. We theorize that the heterogeneity of the disease preeclampsia and the diverse and numerous sources of C-RNA manifests as abnormally prevalent outlier signals as well as smaller fold-change differences and lower signal-to-noise ratios than standard, single-tissue RNA-Seq.

The attached CRNA-DEX-ANALYSIS-sharing.R.txt R script contains the code used to run differential expression analysis (with and without jackknifing) on transcript read counts obtained from bioinformatic processing. The attached CRNA-ADABOOST-ANALYSIS-sharing.py.txt python script contains the code used to optimize hyperparameters and fit AdaBoost models to the iPEC C-RNA data. As the AdaBoost implementation is not deterministic, users should expect the models generated to be similar, but not necessarily identical, if run multiple times. Input file formatting requirements are described in each script.

References

1. B. Langmead, S. L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-359 (2012).

2. H. Li, Seqtk. https://github.com/lh3/seqtk, (2016).

3. B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25 (2009).

4. D. Kim et al., TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36 (2013).

5. B. Institute, Picard Toolkit. Broad Institute, GitHub Repository http://broadinstitute.github.io/picard/, (2019).

6. H. Li et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).

7. Y. Liao, G. K. Smyth, W. Shi, featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923-930 (2014).

8. S. Andrews, FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, (2010).

9. T. Daley, A. D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods 10, 325-327 (2013).

10. L. Wang, S. Wang, W. Li, RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184-2185 (2012).

11. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J Mol Biol 215, 403-410 (1990).

Related files

CRNA-ADABOOST-ANALYSIS-sharing.py.txt download

CRNA-DEX-ANALYSIS-sharing.R.txt download

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

Rohrback, S, Munchel, S and Kaper, F(2020). Bioinformatic Analysis. Bio-protocol Preprint. bio-protocol.org/prep653.
Munchel, S., Rohrback, S., Randise-Hinchliff, C., Kinnings, S., Deshmukh, S., Alla, N., Tan, C., Kia, A., Greene, G., Leety, L., Rhoa, M., Yeats, S., Saul, M., Chou, J., Bianco, K., O’Shea, K., Bujold, E., Norwitz, E., Wapner, R., Saade, G. and Kaper, F.(2020). Circulating transcripts in maternal blood reflect a molecular signature of early-onset preeclampsia . Science Translational Medicine 12(550). DOI: 10.1126/scitranslmed.aaz0131