使用开放工具进行从头转录组组装、注释和基因本体分析的非模式生物管道：Scots Pine案例研究

Gustavo T. Duarte; Polina Yu. Volkova; Stanislav A. Geras’kin

doi:10.21769/BioProtoc.3912

Improve Research Reproducibility A Bio-protocol resource

提交稿件
订阅
登录
/
注册
- 个人主页
- 编辑个人信息
- 修改密码
- 退出
CN
- EN - English
- CN - 中文

Peer-reviewed

A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine

使用开放工具进行从头转录组组装、注释和基因本体分析的非模式生物管道：Scots Pine案例研究

GD Gustavo T. Duarte email

PV Polina Yu. Volkova

SG Stanislav A. Geras’kin

发布: 2021年02月05日第11卷第3期 DOI: 10.21769/BioProtoc.3912 浏览次数: 7581

评审: Antony ChettoorXinlei LiChristian Sailer

PDF

Q&A

引用

Cited by

参见作者原研究论文

The authors used this protocol in:

Cover of Environmental Pollution, featuring study using the protocol.

Jul 2019

实验方案合集

Cell Imaging - A Special Collection for Cell Bio 2023

Abstract

RNA sequencing (RNA-seq) has opened up the possibility of studying virtually any organism at the whole transcriptome level. Nevertheless, the absence of a sequenced and accurately annotated reference genome may be an obstacle for applying this technique to non-model organisms, especially for those with a complex genome. While de novo transcriptome assembly can circumvent this problem, it is often computationally demanding. Furthermore, the transcriptome annotation and Gene Ontology enrichment analysis without an automatized system is often a laborious task. Here we describe step-by-step the pipeline that was used to perform the transcriptome assembly, annotation, and Gene Ontology analysis of Scots pine (Pinus sylvestris), a gymnosperm species with complex genome. Using only free software available for the scientific community and running on a standard personal computer, the pipeline intends to facilitate transcriptomic studies for non-model species, yet being flexible to be used with any organism.

Keywords: RNA-seq (RNA-seq)

De novo assembly (从头组装)

Non-model organism (非模式生物)

Pinus sylvestris (樟子松)

Gymnosperm (裸子植物)

Transcriptome tutorial (转录组教程)

Background

Non-model organisms are valuable for environmental and evolutionary studies. However, the absence of a closely related model species to serve as reference for transcriptomic studies was a limitation until the development of the de novo assembly methods. De novo assemblers brought the non-model organisms to the omics era, but the required analyses may be computationally demanding and are often time consuming. This is especially the case for testing, evaluating, and establishing the pipeline for performing the required analyses. The costs for acquiring a computing server and/or software for automated data processing, which could speed up the process, may impose limitations to some research groups. We detail here the procedures that were employed for the transcriptome assembly, annotation, and gene ontology (GO) analysis of Scots pine (Pinus sylvestris), an organism with complex genome (Duarte et al., 2019). The pipeline was developed based on works and benchmark evaluations that describe the best practices for transcriptome studies (Conesa et al., 2016; Honaas et al., 2016; Geniza and Jaiswal, 2017). The pine transcriptome analyses were performed on a 32G of RAM 8-cores machine. The outputs of three different assemblers, namely BinPacker (Liu et al., 2016), SOAPdenovo-Trans (Xie et al., 2014), and Trinity (Grabherr et al., 2011), were evaluated individually and combined. For the pine transcriptome assembly, the best result was obtained by combining the outputs of the different assemblers, following filtering for redundancies with EvidentialGene (Gilbert, 2020). The selection and combination of assemblers, nevertheless, should be adjusted according to the input data, and the decision can be made based on the quality assessment that is described step-by-step here. InterProScan (Jones et al., 2014) performs a comprehensive signature annotation of the predicted proteins, while BLAST+ (Camacho et al., 2009) allows intra- and interspecific mapping for functional comparisons. Because of that, we combined InterProScan’s protein signatures with several BLAST+ searches for annotating the transcriptome of P. sylvestris (Duarte et al., 2019), which were integrated with Trinotate (Bryant et al., 2017). The unique GO identifiers were retrieved with a Python script that we make available, and used for GO enrichment analysis with BiNGO (Maere et al., 2005). The pipeline is based on tools that are open for the scientific community. While it is especially interesting for studies with non-model organisms, which lack closely related and well-annotated species for guiding the assembly and annotation, the pipeline is flexible to be applied to virtually any organism.

Software

Software – Linux OS:

Several programs that will be used in the protocol, such as BLAST+, Bowtie2, FASTA Splitter, and BUSCO, can easily be managed using the Bioconda channel (Grüning et al., 2018; https://bioconda.github.io/). Bioconda is a repository of packages specialized in bioinformatics, and it requires the conda package to be installed. The manager automatically installs all dependencies for a given program and makes it available in your PATH. The software support via Bioconda is continuously updated, and we suggest always to check if the required software is available through this repository. After installing conda, add the channels following the recommended order (please refer to https://bioconda.github.io/user/install.html#install-conda). Other software will need to be installed individually, and may require adjustments depending on your system. Next, we list the software that will be required for each step, which are summarized in the flowchart (Figure 1).

Figure 1. Flowchart illustrating the five steps for transcriptome assembly, annotation, and Gene Ontology analysis. The input for starting each step is represented in the boxes, and the main procedures of each step are described in black. The software that will be required for each step are listed in blue. For performing the transcriptome assembly (B), the user may use the software of choice. The three assemblers described in this pipeline were used for the Scots pine study (Duarte et al., 2019), and are marked as optional [opt].

Data pre-processing
1. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) – quality control of sequencing files
2. FastQ Screen (https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen) – screening of sequencing files
3. Trimmomatic (Bolger et al., 2014; available via Bioconda) – trimming of sequencing files
4. Trinity package (Grabherr et al., 2011) – in silico normalization

Transcriptome assembly
1. BinPacker (Liu et al., 2016; https://sourceforge.net/projects/transcriptomeassembly/) – assembler
2. EvidentialGene (Gilbert, 2020; http://arthropods.eugenes.org/EvidentialGene/) – assembly filtering
3. Galaxy platform (Afgan et al., 2018; usegalaxy.org) – open source online platform for data research
4. SOAPdenovo-Trans (Xie et al., 2014; https://github.com/aquaskyline/SOAPdenovo-Trans) – assembler
5. Trinity package (Grabherr et al., 2011; available in Galaxy platform) – assembler

Quality assessment
1. Bowtie2 (Langmead and Salzberg, 2012; available via Bioconda) – short read aligner
2. BLAST+ (Camacho et al., 2009; available via Bioconda) – tool for comparing sequences and searching databases
3. BUSCO (Simão et al., 2015; available via Bioconda) – assembly completeness assessment
4. Samtools (Li et al., 2009; https://github.com/samtools/samtools) – data manipulation
5. DETONATE (Li et al., 2014; available via Bioconda) – evaluation of de novo transcriptome assemblies
6. Trinity package (Grabherr et al., 2011) – estimation of full-length transcripts in the assembly

Transcriptome annotation
1. BLAST+ (Camacho et al., 2009; available via Bioconda) – tool for comparing sequences and searching databases
2. Bowtie2 (Langmead and Salzberg, 2012; available via Bioconda) – short read aligner
3. Corset (Davidson and Oshlack, 2014; available via Bioconda) – clustering of read counts
4. FASTA splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/) – creation of subsets of a FASTA file
5. getorf (Rice et al., 2000; http://emboss.sourceforge.net/) – prediction of open reading frames
6. HMMER (http://hmmer.org/) – prediction of domains for annotation
7. InterProScan (Jones et al., 2014; https://www.ebi.ac.uk/interpro/) – protein domains identifier
8. RNAMMER (Lagesen et al., 2007; http://www.cbs.dtu.dk/services/RNAmmer) – ribosomal RNA annotation
9. SignalP (Petersen et al., 2011; https://services.healthtech.dtu.dk/) – prediction of signal peptides
10. TmHMM (Krogh et al., 2001; https://services.healthtech.dtu.dk/) – protein transmembrane helix prediction
11. TransDecoder (Haas et al., 2013; http://transdecoder.github.io) – identification of coding regions
12. Trinotate (Bryant et al., 2017; https://trinotate.github.io/) – suite for transcriptome annotation

Gene Ontology analysis
1. BiNGO (Maere et al., 2005; download available via Cytoscape platform) – Gene Ontology analysis
2. Cytoscape (Shannon et al., 2000; https://cytoscape.org/) – platform for performing the Gene Ontology analysis
3. GO retriever – provided script for retrieving transcript ID - Gene Ontology ID relationships

Procedure

English

中文翻译

文章信息

版权信息

如何引用

Duarte, G. T., Volkova, P. Y. and Geras’kin, S. A. (2021). A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine. Bio-protocol 11(3): e3912. DOI: 10.21769/BioProtoc.3912.