De novo assembly of the LA2093 genome

Xin Wang; Lei Gao; Chen Jiao; Stefanos Stravoravdis; Prashant S. Hosmani; Surya Saha; Jing Zhang; Samantha Mainiero; Susan R. Strickler; Carmen Catala; Gregory B. Martin; Lukas A. Mueller; Julia Vrebalov; James J. Giovannoni; Shan Wu; Zhangjun Fei

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

De novo assembly of the LA2093 genome

XW Xin Wang

LG Lei Gao

CJ Chen Jiao

SS Stefanos Stravoravdis

PH Prashant S. Hosmani

SS Surya Saha

JZ Jing Zhang

SM Samantha Mainiero

SS Susan R. Strickler

CC Carmen Catala

GM Gregory B. Martin

LM Lukas A. Mueller

JV Julia Vrebalov

JG James J. Giovannoni

SW Shan Wu

ZF Zhangjun Fei

This method is extracted from research article: Nat Commun, Nov 2020

Genome of Solanum pimpinellifolium provides insights into structural variants during tomato breeding

DOI: 10.1038/s41467-020-19682-0

Request a Protocol

Ask a question

Favorite

Raw PacBio reads were error corrected and assembled into contigs using CANU^⁴⁷ (v1.7.1) with default parameters except that ‘OvlMerThreshold’ and ‘corOutCoverage’ were set to 500 and 200, respectively. PacBio reads were then aligned to the contigs and based on the alignments errors in the assembled contigs were corrected using the Arrow program implemented in SMRT-link-5.1 (PacBio). Furthermore, the Illumina paired-end reads were processed to remove adaptor and low-quality sequences using Trimmomatic^⁴⁸ (v0.36). The cleaned Illumina reads were aligned to the contigs using BWA-MEM^⁴⁹ (v0.7.17) with default parameters, and based on the alignments two rounds of iterative error corrections were performed using Pilon^⁵⁰ (v1.22) with parameters ‘–fix bases–diploid’. The final error-corrected contigs were then compared against the NCBI non‐redundant nucleotide database, and those with more than 95% of their length similar to sequences of organelles (mitochondrion or chloroplast) or microorganisms (bacteria/fungi/viruses), were considered contaminants and discarded. The redundans pipeline^⁵¹ (v0.14a) was then used to remove redundancies in the assembled contigs with parameters ‘--identity 0.99 --overlap 0.97’.

To scaffold the assembled contigs, Illumina reads from the Hi-C library were processed with Trimmomatic^⁴⁸ (v0.36) to remove adaptor and low-quality sequences. The cleaned Hi-C reads were aligned to the assembled contigs and the alignments were filtered using the Arima-HiC mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline). Based on the alignments, the contigs were clustered into pseudomolecules using SALSA^⁵² (v2.2) with parameters ‘-e GATC -i 3’. Furthermore, contigs of LA2093 were also assembled into pseudomolecules by comparing them with the Heinz1706 reference genome^²⁰ (version 4.0) using RaGOO^⁶ (v1.1). Inconsistencies between pseudomolecules constructed using the Hi-C data and those using the synteny information with the Heinz1706 genome were identified. The mis-joined scaffolds were manually corrected based on the Hi-C contact information, genome synteny information, and a genetic map constructed from a recombinant inbred line (RIL) population with LA2093 as one of the parents^¹⁶, resulting a consensus set of LA2093 pseudomolecules. Finally, the genetic map was also used to validate the final consensus set of LA2093 pseudomolecules using ALLMAPS^⁵³ (v0.8.12). Inconsistencies between the LA2093 pseudomolecules and genetic maps were also manually checked and the accuracy of the LA2093 pseudomolecules was further validated using PacBio read alignment information.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol