Iso-Seq is an application of Pacific Biosciences (PacBio) long-read sequencing technology that uses cDNA templates to generate high quality reads for full-length transcripts. The high error rate generally associated with PacBio sequencing is drastically reduced using circular consensus sequencing (CCS), which uses hairpin adapters on each end of a double-stranded molecule to create a circular, single-stranded topology (Au et al. 2012; Rhoads and Au 2015; Hestand et al. 2016; Wenger et al. 2019). This topology allows the polymerase to read the same full-length molecule multiple times over, generating an accurate consensus sequence (Ono et al. 2013; Wang et al. 2019). PacBio Iso-Seq has been used to study the transcriptomes of many organisms, often in the context of identifying splice variants, or alternative transcripts (Gordon et al. 2015; Rhoads and Au 2015; Xu et al. 2015; Abdel-Ghany et al. 2016; Guo et al. 2016; Wang et al. 2016; Weirather et al. 2017). Alternative transcripts can be identified using CCS because this technology obtains consensus sequences for full-length single transcripts (Zhao et al. 2019). In the same way, CCS can also be used to distinguish paralogs or gene duplicates.
To create an Iso-Seq library for S. noctiflora, the four RNA extractions (1.5 µg each) were pooled into a single sample and sent to the Arizona Genomics Institute for PacBio Iso-Seq library construction and sequencing. The library was constructed on the pooled RNA sample using Poly(A) selection, following the standard PacBio Iso-Seq protocol (“Procedure & Checklist—Iso-Seq Template Preparation for Sequel Systems,” Pacific Biosciences, PN-101-070-200 Version 06, September 2018), and then was sequenced with a PacBio Sequel (first generation) platform on two SMRT Cells.
Raw movie files of long-read, single-molecule sequences (one per SMRT Cell) were processed using the PacBio Iso-Seq v3.1 pipeline (Anvar et al. 2018; https://www.pacb.com/products-and-services/analytical-software/rna-sequencing/). Circular consensus sequence calling was performed on each movie file separately using the command ccs with the recommended parameters –noPolish and –minPasses 1. Next, primer removal was performed on each dataset by running the command lima with parameters –isoseq and –no-pbi. Poly(A) tails were trimmed and concatemers were removed using the refine command with the parameter –require-polya. Data from the two cells were merged at this point using the commands dataset create –type TranscriptSet and dataset create –type SubreadSet. Finally, the merged data were run through the cluster and polish commands. We also ran the cluster and polish commands on each dataset individually after skipping the merge step.
Trinotate v3.2.0 (Bryant et al. 2017) was used to annotate the final polished sequences produced by the Iso-Seq pipeline after merging the datasets. To complete this process, we used Transdecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder/wiki), SQLite v3 (Kreibich 2010), NCBI BLAST + v2.2.29 (Camacho et al. 2009), HMMER v3.2.1 (including RNAMMER) (Lagesen et al. 2007; Potter et al. 2018), signalP v4 (Petersen et al. 2011), and tmhmm v2 (Krogh et al. 2001). The Pfam (Bateman et al. 2004) and UniProt (UniProt Consortium 2015) databases were included in the Trinotate installation. The transcripts and Transdecoder-predicted peptides were searched against the respective databases, following the standard Trinotate pipeline. All of these results were loaded into a Trinotate SQLite database.
Cogent v4.0.0 (https://github.com/Magdoll/Cogent/wiki) and minimap2 v2.17 (Li 2018) were used to conduct family findings on the final sequences by the Iso-Seq pipeline by partitioning sequences into groups based on similarity. While the Iso-Seq pipeline collapses reads into individual transcripts, it does not collapse alternative transcripts originating from the same gene. Cogent further collapses alternative transcripts into groups, where each group is meant to represent a single gene. Next, coding genome reconstruction was performed on each group from the above step; thus, the Cogent output included both a file containing groups of alternative transcripts (final.partition.txt at https://github.com/alissawilliams/Silene_noctiflora_IsoSeq) and a transcript-based genome. Finally, this transcript-based genome was used to determine total gene and isoform (alternative transcript) counts via cDNA_Cupcake scripts (https://github.com/Magdoll/cDNA_Cupcake/wiki; Jeffries et al. 2020; Wang et al. 2020). A modified form of the script make_file_for_sampling_from_collapsed.py was run with the parameter –include_single_exons in order to include all transcripts in the analysis. Gene and isoform counts were calculated using custom Python and R scripts on the resultant file. These Cogent, minimap2, and cDNA_Cupcake steps were performed on the merged dataset as well as individually on the datasets from each SMRT Cell.
We used genes from the plastid caseinolytic protease (Clp) as a case study to assess the ability of Iso-Seq dataset to distinguish paralogs (gene duplicates) of various levels of divergence. To identify nuclear-encoded plastid Clp core genes in our dataset, we used blastn in conjunction with the Cogent output. There are eight nuclear-encoded plastid Clp core genes in Arabidopsis thaliana: CLPP3-6 and CLPR1-4 (Nishimura and van Wijk 2015). In addition, the genus Silene shares a duplication of CLPP5, denoted CLPP5A and CLPP5B (Rockenbach et al. 2016). We obtained the sequences of all nine of these genes from a previous study (Rockenbach et al. 2016) and used them as queries in blastn searches against the S. noctiflora Iso-Seq transcriptome. We then identified which groups of collapsed alternative transcripts (from the Cogent output) contained these BLAST hits. BLAST hits for eight of the nine nuclear-encoded Clp core subunits in Silene (including CLPP5A and CLPP5B) were found in a single Cogent group. The sequences within each group were confirmed to represent a single gene via alignment and manual inspection; thus, these eight core subunits are single copy in S. noctiflora. However, in the case of CLPR2, two different Cogent groups contained relevant transcripts, indicating a possible case of gene duplication. Sequence alignment and manual inspection of the transcripts within these two Cogent groups revealed that one group contained two unique sequences. These data, along with sequencing results from a separate project in which we cloned two versions of S. noctiflora CLPR2 using primers designed for S. latifolia CLPR2, suggested that there are actually three distinct CLPR2 sequences in S. noctiflora. In the subsequent phylogenetic analysis of CLPR2, we used the longest sequences from each of the three identified groups.
A phylogenetic tree was constructed using sequences from the three different S. noctiflora CLPR2 genes. In addition to the three S. noctiflora sequences, we also included Agrostemma githago, S. conica, S. latifolia, Silene paradoxa, and Silene vulgaris CLPR2 sequences from a previous study (Rockenbach et al. 2016), as well as three S. undulata CLPR2 sequences identified using blastn against the S. undulata TSA database (accession GEYX00000000). All 11 sequences were aligned using the einsi option in MAFFT v7.222 (Katoh and Standley 2013), and trimmed at the 5′ end based on the trimming conducted in Rockenbach et al. (2016). The resultant sequence file was run through jModelTest v2.1.10 (Darriba et al. 2012) to choose a model of sequence evolution. We chose the top model based on the Bayesian Information Criterion (K80+I) and ran PhyML v3.3 (Guindon et al. 2010) with 1000 bootstrap replicates and 100 random starts.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.