Advanced Search
Last updated date: Nov 24, 2020 Views: 1004 Forks: 0
Command Line Processing of Sequencing Reads
The following list describes of the key steps to follow for command line processing of C-RNA sequencing data. Table 1 specifies the exact software used for each processing step. Multiple software packages and versions exist for many of these functions, and these will typically produce comparable results. However, it is beyond the scope of this document to delineate all possible options, or their corresponding caveats.
1. Demultiplex: generate fastq files for each sample from raw sequencing data.
a. If a given library is sequenced on multiple sequencing runs, the fastq files must be concatenated before proceeding
2. Downsample: randomly subsample a specified number of fastq reads
a. 50M reads (maximum) was used for this publication. For this protocol, we recommend processing at least 20M reads per sample. Higher sequencing depths can reduce technical noise, though there appear to be diminishing returns with >100M reads per sample.
3. Filter Abundant Sequences: Remove sequencing reads from common contaminants
a. Important: Save the remaining (non-abundant) reads in fastq format with the “--un” option. These files are the input in step 4.
b. Use a reference index containing sequences for undesired sequences such as Illumina adapters, chrM, human ribosomal and 5S DNA, phage phiX174, polyA and polyC. The appropriate composition of this reference may depend on experimental details
4. Map to Reference Genome: Identify where in the genome (and/or transcriptome) each sequencing read originates from.
a. Bowtie2 (1) must be installed; we used v2.2.3.
5. Sort Mapped Reads: Lexicographical sorting of sequencing reads by mapping coordinates.
6. Index Mapped Reads: Generate an index file used in step 7.
7. Count Reads per Transcript:
a. Ensure that the transcriptome reference file is in GTF format, and from the same genome assembly as used for mapping in step 4.
Table 1. Command line software used for data processing.
Processing Step | Software | Sub-Function* | Version | Recommended Command Line Options* |
1 | Bcl2fastq2 | - | v2.20.0.422 | --barcode-mismatches 0 --ignore-missing-bcls --ignore-missing-filter --ignore-missing-positions |
2 | Seqtk (2) | sample | v1.2-r102-dirty | -2 |
3 | Bowtie (3) | - | v1.0.0 | -k 1 -n 0 -l 25 --mapq 10 |
4 | TopHat2 (4) | - | v2.0.13 | --no-coverage-search --no-novel-indels --b2-fast |
5 | Picard (5) | ReorderSam | v1.93 | - |
6 | Samtools (6) | index | v0.1.19 | - |
7 | Subread (7) | featureCounts | v1.4.6 | -t exon -g gene_id -Q 10 --primary -p |
* * - indicates the category is not applicable for the given processing step
Dataset Quality Control (QC) Assessments
QC checks are crucial to have confidence in the quality of datasets – and therefore the conclusions drawn from them. However, which metrics are most informative and what thresholds to set are likely dependent on a number of experimental variables, including sample type and quantity, C-RNA extraction, enrichment, and sequencing library preparation protocols.
While recommendations for universal QC requirements are not yet possible, we strongly advise any users to examine the quality of all datasets prior to downstream analyses. Commonly useful tools and data checks are listed below.
· The percent of reads excluded in processing step 3 (filtering abundant sequences). If this value is large, the specific sequence(s) represented may help troubleshoot assay performance issues.
· Mapping rates (processing step 4). Samples with abnormally low values warrant further examination.
· Software packages
o FastQC (8): provides a variety of useful measurements about sequencing run quality.
o Preseq (9): estimates library complexity, abnormally high or low values warrant further investigation.
o RSeQC (10): provides a variety of useful measurements about RNAseq data quality. Gene body coverage and transcript abundance saturation are particularly informative.
o BLAST (11): running BLAST on a selection of unmapped reads can confirm or rule out unexpected contaminations.
Group Comparison Analyses
Much more characterization is needed from a wide range of applications, populations, and preparations before universal recommendations can be made for how best to generate biological interpretations from C-RNA sequencing data; and our approaches used in the provided code may or may not be optimized for a different dataset.
We developed custom analyses to address a specific challenge: high biological variability. Even with relatively large sample sizes, we observed inconsistent results when using standard tools. We theorize that the heterogeneity of the disease preeclampsia and the diverse and numerous sources of C-RNA manifests as abnormally prevalent outlier signals as well as smaller fold-change differences and lower signal-to-noise ratios than standard, single-tissue RNA-Seq.
The attached CRNA-DEX-ANALYSIS-sharing.R.txt R script contains the code used to run differential expression analysis (with and without jackknifing) on transcript read counts obtained from bioinformatic processing. The attached CRNA-ADABOOST-ANALYSIS-sharing.py.txt python script contains the code used to optimize hyperparameters and fit AdaBoost models to the iPEC C-RNA data. As the AdaBoost implementation is not deterministic, users should expect the models generated to be similar, but not necessarily identical, if run multiple times. Input file formatting requirements are described in each script.
References
1. B. Langmead, S. L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-359 (2012).
2. H. Li, Seqtk. https://github.com/lh3/seqtk, (2016).
3. B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25 (2009).
4. D. Kim et al., TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36 (2013).
5. B. Institute, Picard Toolkit. Broad Institute, GitHub Repository http://broadinstitute.github.io/picard/, (2019).
6. H. Li et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).
7. Y. Liao, G. K. Smyth, W. Shi, featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923-930 (2014).
8. S. Andrews, FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, (2010).
9. T. Daley, A. D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods 10, 325-327 (2013).
10. L. Wang, S. Wang, W. Li, RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184-2185 (2012).
11. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J Mol Biol 215, 403-410 (1990).
Related files
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link