Comparative analysis among cp genomes assemblies

Carolina Osuna-Mascaró; Rafael Rubio de Casas; Francisco Perfectti

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Comparative analysis among cp genomes assemblies

CO Carolina Osuna-Mascaró

RC Rafael Rubio de Casas

FP Francisco Perfectti

This method is extracted from research article: Sci Rep, Nov 2018

Comparative assessment shows the reliability of chloroplast genome assembly using RNA-seq

DOI: 10.1038/s41598-018-35654-3

Request a Protocol

Ask a question

Favorite

To compare the cp genomes assembled from DNA or RNA libraries, we used the mVISTA software, part of the VISTA suite of tools for comparative genomics (http://genome.lbl.gov/vista/mvista/submit.html). This software compares DNA sequences from different species by pairwise alignment and allows the visualization of these alignments with annotation information. The output allows the identification of homologies between sequences, determining the percentage of identity between them using a sliding window of predefined length. We selected default parameters, a RankVISTA probability threshold of 0.5, and the Shuffle-LAGAN mode, which is a global alignment algorithm for finding rearrangements (inversions, transpositions, and some duplications). We used the A. thaliana cpDNA as a reference (NC_000932.1)^⁴⁵. The sequence conservation profiles were visualized in mVISTA plots^⁴⁷.

We investigated the degree of within-genome variation of the assembled cp genomes. In particular, we performed a reference-guided assembly in which we remapped the quality-trimmed reads (as for the RNA-Seq assemblies, see above) to each assembled genome using the Geneious R. 11^⁴² mapper with medium-low sensitivity and default parameters (http://www.geneious.com)^⁴². Later, we estimated the percentage of pairwise identity of each assembly. This statistic gives the average identity (as %), computed by scoring a hit when all pairs of bases are identical and dividing it by the total numbers of pairs.

For each species, we explored the degree of overall sequence variation found within the three replicas of RNA-Seq assembled genomes and then compared the results to those of a similar analyses that included also the genome assembled from genomic libraries. For this purpose, we estimated the nucleotide diversity (π) among the three replicas of cp genomes assembled from RNA-Seq, and then computed it again including the corresponding genomic library. Genomes were first aligned using MAFFT with the following parameters: FFT-NS-2 fast progressive method algorithm, a scoring matrix of 200PAM/k = 2, gap open penalty of 1.53 and offset value of 0.123. Then, we estimated the cpDNA nucleotide diversity using VariScan v.2.0.3^⁴⁸.

We studied the degree of sequence variation of some relevant chloroplast genes within the three replicas of RNA-Seq, and then explored it but including the genes assembled from genomic libraries. We first extracted and assembled all the chloroplast genes using the HybPiper pipeline v.1.2^⁴⁹. This pipeline uses BWA^⁴³ to align reads to target sequences, and SPAdes^⁵⁰ to assemble these reads into contigs. Once cpDNA genes were obtained, we selected 12 genes out of the total: rbcl, psaA, psbA, ndhK, atpA, atpH (with an important function in the photosynthesis process^⁵¹), rpoA, rps3, rrn16S, trnH (as self replication genes^⁵²), yfc2 (the largest plastid gene in angiosperms^⁵³), and matK (the only maturase of higher plants and widely used in angiosperm systematic^⁵⁴). Then, we aligned these genes using MAFFT, as explained above. Lastly, we calculated the percentage of pairwise identity between the genes obtained from the three RNA-Seq replicas, and the same but including those from genomic libraries.

The size and location of repeat sequences, including palindromic, reverse and direct repeats, within these cp genomes were identified using REPuter software^⁵⁵. Following Asaf et al.^⁸ and Ni et al.^⁵⁶ REPuter was parametrized with the following settings: Hamming distance of 3; 90% or greater sequence identity; and minimum repeat size of 30 bp.

Simple sequence repeat (SSR) elements were detected using the Perl script MISA^⁵⁷ by setting the minimum number of repeats to 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta- and hexanucleotides, respectively.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol