To compare the cp genomes assembled from DNA or RNA libraries, we used the mVISTA software, part of the VISTA suite of tools for comparative genomics (http://genome.lbl.gov/vista/mvista/submit.html). This software compares DNA sequences from different species by pairwise alignment and allows the visualization of these alignments with annotation information. The output allows the identification of homologies between sequences, determining the percentage of identity between them using a sliding window of predefined length. We selected default parameters, a RankVISTA probability threshold of 0.5, and the Shuffle-LAGAN mode, which is a global alignment algorithm for finding rearrangements (inversions, transpositions, and some duplications). We used the A. thaliana cpDNA as a reference (NC_000932.1)45. The sequence conservation profiles were visualized in mVISTA plots47.
We investigated the degree of within-genome variation of the assembled cp genomes. In particular, we performed a reference-guided assembly in which we remapped the quality-trimmed reads (as for the RNA-Seq assemblies, see above) to each assembled genome using the Geneious R. 1142 mapper with medium-low sensitivity and default parameters (http://www.geneious.com)42. Later, we estimated the percentage of pairwise identity of each assembly. This statistic gives the average identity (as %), computed by scoring a hit when all pairs of bases are identical and dividing it by the total numbers of pairs.
For each species, we explored the degree of overall sequence variation found within the three replicas of RNA-Seq assembled genomes and then compared the results to those of a similar analyses that included also the genome assembled from genomic libraries. For this purpose, we estimated the nucleotide diversity (π) among the three replicas of cp genomes assembled from RNA-Seq, and then computed it again including the corresponding genomic library. Genomes were first aligned using MAFFT with the following parameters: FFT-NS-2 fast progressive method algorithm, a scoring matrix of 200PAM/k = 2, gap open penalty of 1.53 and offset value of 0.123. Then, we estimated the cpDNA nucleotide diversity using VariScan v.2.0.348.
We studied the degree of sequence variation of some relevant chloroplast genes within the three replicas of RNA-Seq, and then explored it but including the genes assembled from genomic libraries. We first extracted and assembled all the chloroplast genes using the HybPiper pipeline v.1.249. This pipeline uses BWA43 to align reads to target sequences, and SPAdes50 to assemble these reads into contigs. Once cpDNA genes were obtained, we selected 12 genes out of the total: rbcl, psaA, psbA, ndhK, atpA, atpH (with an important function in the photosynthesis process51), rpoA, rps3, rrn16S, trnH (as self replication genes52), yfc2 (the largest plastid gene in angiosperms53), and matK (the only maturase of higher plants and widely used in angiosperm systematic54). Then, we aligned these genes using MAFFT, as explained above. Lastly, we calculated the percentage of pairwise identity between the genes obtained from the three RNA-Seq replicas, and the same but including those from genomic libraries.
The size and location of repeat sequences, including palindromic, reverse and direct repeats, within these cp genomes were identified using REPuter software55. Following Asaf et al.8 and Ni et al.56 REPuter was parametrized with the following settings: Hamming distance of 3; 90% or greater sequence identity; and minimum repeat size of 30 bp.
Simple sequence repeat (SSR) elements were detected using the Perl script MISA57 by setting the minimum number of repeats to 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta- and hexanucleotides, respectively.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.