Reference genome assembly and annotation

Alex Rajewski; Derreck Carter-House; Jason Stajich; Amy Litt

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Reference genome assembly and annotation

AR Alex Rajewski

DC Derreck Carter-House

JS Jason Stajich

AL Amy Litt

This method is extracted from research article: BMC Genomics, Mar 2021

Datura genome reveals duplications of psychoactive alkaloid biosynthetic genes and high mutation rate following tissue culture

DOI: 10.1186/s12864-021-07489-2

Request a Protocol

Ask a question

Favorite

All scripts used to assemble and annotate this reference genome are available in a public Github repository (https://github.com/rajewski/Datura-Genome).

We first created several short-read only assemblies using ABySS (v2.0.2) with odd kmer sizes from 33 to 121 bp, but ultimately selected k = 101 as the optimal kmer size based on the assembly’s BUSCO score using the embryophyta version 9 lineage dataset [98, 99].

Following base calling by Guppy, we error-corrected the Nanopore reads using LoRDEC (v0.9) [100]. We then used the optimal ABySS assembly for several iterations of scaffolding, gap-filling, and polishing using LINKS (v1.8.4), RAILS (v1.5.1), and ntEdit (v1.3.0), respectively [101–103]. For LINKS scaffolding, we selected a relatively high kmer size of 19 bp because we were using error-corrected Nanopore reads. We scaffolded with insert sizes of 750 bp, 1 kb, 5 kb, 10 kb, 15 kb, 20 kb, 30 kb, 40 kb, 60 kb, 70 kb, 80 kb, 90kp, and 100 kb. Gap filling with RAILS also used the error-corrected LoRDEC reads. Polishing with ntEdit was run several times after each scaffolding or gap-filling step until the number of edits stabilized. The kmer size for ntEdit was 50 bp.

Prior to gene annotation, we used RepeatModeler (v1.0.11) and RepeatMasker (v4–0-7) to generate and soft mask a preliminary set of repetitive elements in the assembled genome [49, 50]. This set of repetitive elements was excluded from the subsequent gene annotation.

We applied the funannotate pipeline (v1.6.0) to annotate the assembled genome for protein coding genes and tRNAs [104]. Funannotate is a wrapper for several evidence-based and ab initio gene prediction softwares but also includes convenience scripts to simplify submission of genome annotations to data repositories such as NCBI. To train the gene predictors, we provided publicly available RNA sequencing data from NCBI SRA accession SRR9888534, along with the D. stramonium reads from medplantrnaseq.org, and mRNA-seq reads generated for the differential gene expression analyses (below). Following the training step, funannotate ran AUGUSTUS (v3.3), GeneMark-ETS (v4.38), SNAP, and GlimmerHMM (v3.0.4) [105–108]. Funannotate combined these gene prediction outputs with alignments of transcripts, generated by Trinity (v2.8.4) and PASA (v2.3.3), and protein evidence and passed them to EVidenceModeler (v1.1.1) which produced a well-supported annotation of protein coding genes [109–111]. Separately, tRNAscan-SE (v2.0.3) searched for and annotated tRNA loci in the assembled genome [112].

Once the annotation of protein coding genes and tRNA loci was completed, we used the Extensive de novo TE Annotator (EDTA) pipeline to create a more thorough annotation of TIR, LTR, and helitron transposable elements [48]. This analysis made use of the gene annotation information to remove potentially protein coding loci from the transposable element inventory.

We used GetOrganelle (v1.7.1) to assemble both organellar genomes [113]. For the plastid genome, we used the previously published D. stramonium plastid assembly (GenBank accession NC_018117) as an alignment seed [61]. To annotate genes as well as the large and small single copy regions and inverted repeat regions, we used GeSeq [63]. For the mitochondrial genome, we used the S. lycopersicum mitochondrial genome (Genbank accession NC_035963) as the seed. To determine the similarity to the reference plastid genome, we aligned with the full-length plastid genomes with MAFFT [114].

We deposited the raw sequencing reads used to assemble this genome in the SRA under NCBI Bioproject PRJNA612504. This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JACEIK000000000.

Summaries of gene features and transposable elements proceeded with custom R scripts that are available in the public GitHub repository.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol