PacBio data were processed and evaluated with several tools in SMRT Analysis (v2.2.0 and v2.3.0) and in-house pipelines (supplementary fig. S16, Supplementary Material online). Briefly, the raw reads in fastq format were extracted from h5 format by bash5tool in the pbh5tools packages. The suggested parameters (MinReadScore = 0.75, MinSRL = 50, MinRL = 50) were adopted to filter and trim the raw reads. The Circular Consensus module in ConsensusTools was then applied to extract ROI reads with the minimal predicted accuracy (minPredictedAccuracy) set at 75. Particularly, the parameter “minFullPasses = 0” was used to increase the sensitivity in ROI reads identification.
Primer sequences flanking ROI reads were trimmed by the hmmer_wrapper tool with the parameters “–primer_search_window 150 –min-score 8 –must-see-polyA –left-nosee-ok,” according to the sequences of primers used during library preparation. In this procedure, primer-trimmed reads were further processed to scan the PA tail (reads with a PA tail were referred to as PA-containing reads, otherwise as PA-free reads). PA-containing reads were trimmed to obtain PA-trimmed reads. To recover the possible PA-containing reads missed by hmmer_wrapper possibly due to the stringent parameter sets, PA-free reads defined by hmmer_wrapper were further processed using in-house scripts. Briefly, a 50-bp window was set at the tail of the read and slid toward the 5′ end. The read was trimmed off from the 3′ end of the sliding window until the A base fraction in the window was ≤0.65. 15,389 reads could be trimmed were recovered as PA-containing reads and pooled with the previous PA-trimmed reads identified by hmmer_wrapper to obtain the final PA-trimmed reads available for subsequent analysis. Reports of pre-mapping evaluation were generated with tools in SMRT Analysis, such as filter_stats, filter_subread_summary, and reads_of_insert_report (supplementary figs. S1 and S3, Supplementary Material online). GC content, base quality, sequencing error, DNA, or pre-mRNA contamination were also evaluated with in-house scripts (supplementary figs. S1 and S3, Supplementary Material online).
PA-trimmed reads were then mapped to the reference genome (hg19 and rheMac2) by GMAP (Wu and Watanabe 2005) with the parameters “–min-intronlength 70 –intronlength 1,100,000 –totallength 2,500,000 –trimendexons 9” for hg19, and “–min-intronlength 70 –intronlength 800,000 –totallength 2,000,000 –trimendexons 15” for rheMac2. The alignments generated were then filtered by in-house scripts, using the parameters (Coverage ≥67% and Identity ≥75%) consulted a previous study (Tilgner etal. 2013).
Next, several additional procedures were conducted to remove ambiguous reads. First, nonuniquely mapped reads were discarded using an alignment score defined as “Coverage × Identity.” For reads with multiple hits, those with an alignment score of the secondary hit <0.8 times of the score of the best hit were retained and defined as uniquely mapped reads. The threshold of 0.8 is actually more stringent than the 0.98 used in (Tilgner etal. 2013). Second, conflicting reads were discarded on the basis of the strand information. We first adjusted the strand information according to the sequence features of the Iso-seq reads. Briefly, due to the circular mode of Iso-seq, an Iso-seq read may be presented as the reverse complementary sequence of the transcript, and a stretch of T bases may be found at the front of the read. If a certain number of T bases could be found at the front of a read (with parameters similar to define the PA tail), the T bases were considered as PA tail and the read sequence was converted into its reverse complementary sequence accordingly. In addition, for spliced read, if the “Splicing Site Score” of its opposite strand was higher than that of the current strand by 2 scores, the strand was modified to the opposite strand. Here, “Splicing Site Score” was measured by summing the number of canonical splice site motifs (GT, GC, and AT for donor; AG and AC for acceptor). After these adjustments, the strand to which the read was mapped should be the same with the annotated transcript. Accordingly, if the strand was opposite to the annotation due to possible mis-mapping, this conflicting read was discarded. For unspliced read, more stringent criteria were used to control for false-positives, in that only reads with high degree of overlap with RefSeq transcripts (>80%) were kept.
Due to the intrinsically high error rate in PacBio sequencing, severe errors near the splice junction could introduce three types of alignment error: 1) missing a mini-exon, 2) adding an extra exon, and 3) misjudging an exon–intron boundary. For conflicting PacBio alignments, junction information from RNA-seq, as well as RefSeq Genes (release 78) for human and revised Ensembl annotations for rhesus macaque (Zhang etal. 2014) were further combined with the PacBio alignments to correct these errors. Briefly, for exons annotated by RefSeq but absent from PacBio alignments, we tried to find the missed exonic region in the two adjacent exons upstream and downstream of the candidate missed exon. We summed the lengths of extra exonic regions in the two adjacent exons in PacBio alignment relative to the RefSeq annotation. If the length of the candidate missing exon was comparable with the summed length (ratio between 50% and 200%), it was included in the PacBio alignment on the basis of RefSeq annotations. Similarly, a candidate extra exon was removed on the basis of RefSeq annotations. For conflicting exon–intron boundaries defined by PacBio, the junctions defined by PacBio alignments, RNA-seq, and RefSeq annotations were compared, and the ends of each junction in PacBio alignment were revised to the site with the most supporting evidence among these three types of junctions. The PacBio alignments were further filtered by discarding those with any exon overlapping with or adjacent to a genomic gap.
To eliminate false positives in PA identification due to DNA contamination or internal priming events, procedures similar to PA tail-trimming were conducted on the regions downstream of the 3′ end of the alignment to identify an A-rich stretch on the genome. Alignments were kept only when the PA tail was 20-bp longer than that of the A-rich stretch on the genome. In some cases, reads were aligned by soft clipping due to sequencing errors at the end of the reads, which may have some influence on the identification of PA sites. Alignments with a 3′-end clipping length >30 bp were discarded. Overall, PacBio alignments passing all the filters were defined as processed PacBio alignments and used in the subsequent analysis.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.