发布: 2021年02月05日第11卷第3期 DOI: 10.21769/BioProtoc.3912 浏览次数: 7060
评审: Antony ChettoorXinlei LiChristian Sailer
Abstract
RNA sequencing (RNA-seq) has opened up the possibility of studying virtually any organism at the whole transcriptome level. Nevertheless, the absence of a sequenced and accurately annotated reference genome may be an obstacle for applying this technique to non-model organisms, especially for those with a complex genome. While de novo transcriptome assembly can circumvent this problem, it is often computationally demanding. Furthermore, the transcriptome annotation and Gene Ontology enrichment analysis without an automatized system is often a laborious task. Here we describe step-by-step the pipeline that was used to perform the transcriptome assembly, annotation, and Gene Ontology analysis of Scots pine (Pinus sylvestris), a gymnosperm species with complex genome. Using only free software available for the scientific community and running on a standard personal computer, the pipeline intends to facilitate transcriptomic studies for non-model species, yet being flexible to be used with any organism.
Keywords: RNA-seq (RNA-seq)Background
Non-model organisms are valuable for environmental and evolutionary studies. However, the absence of a closely related model species to serve as reference for transcriptomic studies was a limitation until the development of the de novo assembly methods. De novo assemblers brought the non-model organisms to the omics era, but the required analyses may be computationally demanding and are often time consuming. This is especially the case for testing, evaluating, and establishing the pipeline for performing the required analyses. The costs for acquiring a computing server and/or software for automated data processing, which could speed up the process, may impose limitations to some research groups. We detail here the procedures that were employed for the transcriptome assembly, annotation, and gene ontology (GO) analysis of Scots pine (Pinus sylvestris), an organism with complex genome (Duarte et al., 2019). The pipeline was developed based on works and benchmark evaluations that describe the best practices for transcriptome studies (Conesa et al., 2016; Honaas et al., 2016; Geniza and Jaiswal, 2017). The pine transcriptome analyses were performed on a 32G of RAM 8-cores machine. The outputs of three different assemblers, namely BinPacker (Liu et al., 2016), SOAPdenovo-Trans (Xie et al., 2014), and Trinity (Grabherr et al., 2011), were evaluated individually and combined. For the pine transcriptome assembly, the best result was obtained by combining the outputs of the different assemblers, following filtering for redundancies with EvidentialGene (Gilbert, 2020). The selection and combination of assemblers, nevertheless, should be adjusted according to the input data, and the decision can be made based on the quality assessment that is described step-by-step here. InterProScan (Jones et al., 2014) performs a comprehensive signature annotation of the predicted proteins, while BLAST+ (Camacho et al., 2009) allows intra- and interspecific mapping for functional comparisons. Because of that, we combined InterProScan’s protein signatures with several BLAST+ searches for annotating the transcriptome of P. sylvestris (Duarte et al., 2019), which were integrated with Trinotate (Bryant et al., 2017). The unique GO identifiers were retrieved with a Python script that we make available, and used for GO enrichment analysis with BiNGO (Maere et al., 2005). The pipeline is based on tools that are open for the scientific community. While it is especially interesting for studies with non-model organisms, which lack closely related and well-annotated species for guiding the assembly and annotation, the pipeline is flexible to be applied to virtually any organism.
Software
Software – Linux OS:
Several programs that will be used in the protocol, such as BLAST+, Bowtie2, FASTA Splitter, and BUSCO, can easily be managed using the Bioconda channel (Grüning et al., 2018; https://bioconda.github.io/). Bioconda is a repository of packages specialized in bioinformatics, and it requires the conda package to be installed. The manager automatically installs all dependencies for a given program and makes it available in your PATH. The software support via Bioconda is continuously updated, and we suggest always to check if the required software is available through this repository. After installing conda, add the channels following the recommended order (please refer to https://bioconda.github.io/user/install.html#install-conda). Other software will need to be installed individually, and may require adjustments depending on your system. Next, we list the software that will be required for each step, which are summarized in the flowchart (Figure 1).
Figure 1. Flowchart illustrating the five steps for transcriptome assembly, annotation, and Gene Ontology analysis. The input for starting each step is represented in the boxes, and the main procedures of each step are described in black. The software that will be required for each step are listed in blue. For performing the transcriptome assembly (B), the user may use the software of choice. The three assemblers described in this pipeline were used for the Scots pine study (Duarte et al., 2019), and are marked as optional [opt].
Data pre-processing
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) – quality control of sequencing files
FastQ Screen (https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen) – screening of sequencing files
Trimmomatic (Bolger et al., 2014; available via Bioconda) – trimming of sequencing files
Trinity package (Grabherr et al., 2011) – in silico normalization
Transcriptome assembly
BinPacker (Liu et al., 2016; https://sourceforge.net/projects/transcriptomeassembly/) – assembler
EvidentialGene (Gilbert, 2020; http://arthropods.eugenes.org/EvidentialGene/) – assembly filtering
Galaxy platform (Afgan et al., 2018; usegalaxy.org) – open source online platform for data research
SOAPdenovo-Trans (Xie et al., 2014; https://github.com/aquaskyline/SOAPdenovo-Trans) – assembler
Trinity package (Grabherr et al., 2011; available in Galaxy platform) – assembler
Quality assessment
Bowtie2 (Langmead and Salzberg, 2012; available via Bioconda) – short read aligner
BLAST+ (Camacho et al., 2009; available via Bioconda) – tool for comparing sequences and searching databases
BUSCO (Simão et al., 2015; available via Bioconda) – assembly completeness assessment
Samtools (Li et al., 2009; https://github.com/samtools/samtools) – data manipulation
DETONATE (Li et al., 2014; available via Bioconda) – evaluation of de novo transcriptome assemblies
Trinity package (Grabherr et al., 2011) – estimation of full-length transcripts in the assembly
Transcriptome annotation
BLAST+ (Camacho et al., 2009; available via Bioconda) – tool for comparing sequences and searching databases
Bowtie2 (Langmead and Salzberg, 2012; available via Bioconda) – short read aligner
Corset (Davidson and Oshlack, 2014; available via Bioconda) – clustering of read counts
FASTA splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/) – creation of subsets of a FASTA file
getorf (Rice et al., 2000; http://emboss.sourceforge.net/) – prediction of open reading frames
HMMER (http://hmmer.org/) – prediction of domains for annotation
InterProScan (Jones et al., 2014; https://www.ebi.ac.uk/interpro/) – protein domains identifier
RNAMMER (Lagesen et al., 2007; http://www.cbs.dtu.dk/services/RNAmmer) – ribosomal RNA annotation
SignalP (Petersen et al., 2011; https://services.healthtech.dtu.dk/) – prediction of signal peptides
TmHMM (Krogh et al., 2001; https://services.healthtech.dtu.dk/) – protein transmembrane helix prediction
TransDecoder (Haas et al., 2013; http://transdecoder.github.io) – identification of coding regions
Trinotate (Bryant et al., 2017; https://trinotate.github.io/) – suite for transcriptome annotation
Gene Ontology analysis
BiNGO (Maere et al., 2005; download available via Cytoscape platform) – Gene Ontology analysis
Cytoscape (Shannon et al., 2000; https://cytoscape.org/) – platform for performing the Gene Ontology analysis
GO retriever – provided script for retrieving transcript ID - Gene Ontology ID relationships
Procedure
文章信息
版权信息
© 2021 The Authors; exclusive licensee Bio-protocol LLC.
如何引用
Duarte, G. T., Volkova, P. Y. and Geras’kin, S. A. (2021). A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine. Bio-protocol 11(3): e3912. DOI: 10.21769/BioProtoc.3912.
分类
植物科学
植物科学
您对这篇实验方法有问题吗?
在此处发布您的问题,我们将邀请本文作者来回答。同时,我们会将您的问题发布到Bio-protocol Exchange,以便寻求社区成员的帮助。
提问指南
+ 问题描述
写下详细的问题描述,包括所有有助于他人回答您问题的信息(例如实验过程、条件和相关图像等)。
Share
Bluesky
X
Copy link