2.1. CryptoGenotyper Development

CY Christine A. Yanta
KB Kyrylo Bessonov
GR Guy Robinson
KT Karin Troell
RG Rebecca A. Guy
ask Ask a question
Favorite

We developed the CryptoGenotyper, a Python (v3.6) program, to perform a fast, accurate and reproducible analysis on raw Sanger sequencing data. Due to common heterozygous peaks in the SSU rRNA gene sequence, we included a heterogeneity detection algorithm into the portion of the tool that targets the SSU rRNA region. This algorithm was inspired by the Mixed Sequence Reader developed by Chang et al. (2012), which utilizes heterozygous base-calling to distinguish mixed sequences within human papilloma virus (HPV) samples by using comparisons against reference sequences to identify indels (insertions-deletions), single nucleotide polymorphisms and sequence repeats. The program's workflow is shown in Fig. 1.

CryptoGenotyper schematic workflow.

The program begins with the user inputting the gene target, sample chromatograms and database (optional). (a) If chromatograms correspond to the gp60 gene target, the sequence is retrieved by analyzing the fluorescent channel intensities. A homology search is performed against the reference database and the repeat region is calculated. (b) If chromatograms correspond to the SSU rRNA gene, the sequence is computed based on the log ratio of intensity and converted to IUPAC based nucleotide code where double peaks appear. Afterwards, the sequence is decomposed with Indelligent (Dmitriev and Rakitov, 2008) and a homology search is performed using BLAST against the reference database. If mixed sequences are determined, they are classified by following the protocol outlined in Chang et al. (2012) for determining the most possible variances (MPVs) and optimal combinations. For both markers, the sequence and species and/or subtype information is outputted.

The CryptoGenotyper is available on two interfaces: web-based (Galaxy) and command-line (GitHub and Bioconda). The tool's Galaxy interface can either be accessed at the Galaxy Europe server (https://usegalaxy.eu/) or can be installed through a user's own Galaxy instance admin interface. Workflows (available from https://github.com/phac-nml/CryptoGenotyper/tree/main/CryptoGenotyper/GalaxyWorkflows) can then be imported to easily analyze multiple samples at once. Once setup, the user uploads their sequencing data to Galaxy using the Get Data feature. To analyze more than one sample, the user must create a dataset list (forward or reverse mode) or a list of dataset pairs (contig mode). From the CryptoGenotyper graphical user interface, the user can select the corresponding marker (SSU rRNA or gp60), reference database (custom or default), and samples (single, collection, paired collection) before executing the tool (Fig. 2A–E). The corresponding results then appear in the Galaxy History panel (Fig. 2F). The command-line version has the same functionality, except the user must define the directory that contains all samples to be analyzed or specify a path to a sample file. If the user wishes to perform a contig mode analysis on a directory with multiple files, the user will also need to define both forward and reverse primer names that are included in the chromatogram filenames so the tool can identify the forward and reverse sequence for each sample. Alternatively, if a directory contains sample files from the same sequencing primer pair (e.g. SSUF and SSUR) or forward or reverse analysis mode is set, the primer names can be omitted.

Graphical user interface of Galaxy tool implementation.

The user must upload Sanger sequence reads (.ab1) using the Get Data feature under Galaxy Tools. The sequence names will appear in the history on the right. To build a contig, forward and reverse reads must be inputted as a dataset pair. For multiple samples to be processed at once, the reads must be inputted as a list. Then the following must be selected: (A) the gene marker (SSU rRNA or gp60), (B) reference database (default or custom) and (C) type of sequences (forward only, reverse only, or forward and reverse). Afterwards the appropriate sequencing files (D) must be selected. When all inputted information is entered, the execute button (E) will launch the analysis. Final typing results appear in the History (F) as two entries corresponding to extracted FASTA sequence(s) and tab-delimited text report file for easy reporting. Workflows have been created to concatenate results from multiple samples available at https://github.com/phac-nml/CryptoGenotyper. The Galaxy tool implementation can be accessed at https://usegalaxy.eu/.

The CryptoGenotyper accepts raw Sanger sequencing chromatogram data file(s) in .ab1 file format as input. This data can correspond to three different sequencing read formats: forward, reverse, or contig. Both forward and reverse mode accept either forward (5′–3′) or reverse (3′–5′) reads, whereas the contig mode requires two files containing both forward and reverse reads. The gene target, either SSU rRNA or gp60, must then be selected.

The CryptoGenotyper tool contains two validated reference databases: an SSU rRNA database and a gp60 database. We manually curated these databases to ensure they contain representative Cryptosporidium reference sequences for species (Table S1) or subtypes (Table S2) that have been described to date on NCBI. The current version of the SSU rRNA database (v1.0) contains unique reference genotypes that were selected based on the criteria set out by Ruecker et al. (2012). To elaborate, each representative sequence contains the polymorphic region within the 613–810 base pair region of C. parvum (AF164102.1), measures at least 400 base pairs in length, contains no more than two ambiguities, has a definitive source, and does not contain any cloned PCR products. The gp60 database contains a representative sequence of all accepted subtypes to date, collected through personal communication with Prof. Lihua Xiao in addition to those described by Xiao and Feng (2017). These subtypes were chosen to be included in the database as they are complete sequences (the trinucleotide repeat region, repetitive sequence (if applicable), and conserved region are present), contain no ambiguities, and were previously used as a reference sequence in other studies.

If a user wishes to use their own database, a corresponding FASTA (.fa) file must be provided as input. The user can either download the existing CryptoGenotyper database available from https://github.com/phac-nml/CryptoGenotyper and append any new sequences or create their own database entirely. The tool will build a BLAST database from the sequences in the imported FASTA file and perform any homologous searches against that database. If no database file is inputted, the tool will default to the most current version of the validated database.

Once all parameters are inputted into the program, the relevant sequencing information is extracted from each .ab1 file. Both the fluorescent channel intensities and Phred quality score for each base location are processed and recorded using the Biopython library functions (Cock et al., 2009). To remove the low-quality bases that inherently exist on the ends of Sanger sequences, both 5′ and 3′ regions are trimmed by scanning the read with a 5-base sliding window and cut when the average Phred quality drops below 20 (99% base call accuracy).

To reflect the heterozygous peaks in the sequence from the chromatogram, the log ratio of intensity (LRi) is calculated for each base location by taking the log base 2 of the quotient of the major fluorescence intensity to the minor fluorescence intensity as described by Chang et al. (2012). According to Chang et al. (2012), an LRi value close to zero reflects two heterozygous peaks with similar intensities. We use an LRi cutoff value of 2.0 in the CryptoGenotyper to reflect the minimum cutoff to determine a ‘mixed’ peak based on fluorescent ratios (Chang et al., 2012). Those peaks with LRi values less than or equal to 2.0 are then converted in IUPAC nucleotide code to represent the multiple nucleotides present at that location.

Once the entire sequence is parsed, it is decomposed using the Indelligent algorithm (Dmitriev and Rakitov, 2008). This algorithm converts the IUPAC code to two corresponding bases, outputting a major and minor sequence (Dmitriev and Rakitov, 2008). For those bases that were ambiguous to the algorithm, major and minor bases are determined based on the intensities of the fluorescent signals.

The major and minor sequences are then queried using BLAST against the SSU rRNA reference database using the following parameters: word_size = 11, match = 1, mismatch = −2, gap_open = 5, gap_extend = 2. This determines the most possible variances (MPVs), which are then categorized as mixed sequences or indels and the optimal combinations are calculated as described by Chang et al. (2012).

To determine the gp60 sequence, the algorithm records the channel with the highest fluorescence at each base location. Afterwards, the gp60 sequence is divided into two regions: the microsatellite region and the varying region.

The program searches the microsatellite region linearly, using a sliding window of three nucleotides, to determine which repeating trinucleotide sequence is encountered based on an array of previously determined trinucleotide repeat sequences. Once the repeat sequence is determined, the program searches linearly and counts the number of times each repeat appears. Finally, the repeat region is reported in standard notation as described in detail in Xiao and Feng (2017) and illustrated in Chalmers et al. (2009).

The varying region of the sequence undergoes a homology search with the following BLAST parameters: match = 1, mismatch = −2, gap_open = 5, gap_extend = 2. The subtype family is determined based on the result with the highest bit score and percent identity.

Once the species and/or subtype is determined, the program outputs the results in FASTA and tab-delimited file formats. The FASTA file (Fig. 3) contains a header indicating the tool's parameters (gene target, mode, version of the reference database, primer names) that were used to produce the results. Following the FASTA format guidelines, the sample name and the species (SSU rRNA) or species and subtype (gp60) is written to the output FASTA file followed by the sequence. This output file is written in a way that it could be directly inputted into BLAST for further investigation.

The CryptoGenotyper FASTA output file.

One of the results file the CryptoGenotyper generates is a FASTA (.fa) file. For both gene target analyses, a header is outputted at the beginning of the file indicating the run parameters (reference file, program mode, forward and reverse primer names). (A) For the SSU rRNA gene target analysis, the sample name and species identified along with its corresponding sequence are outputted. (B) For the gp60 gene target analysis, the output consists of the same name, species, and subtype, followed by the sequence. This file is designed to allow the user to input it directly into BLAST for further analysis, if desired.

The second output file contains the results in a tab-delimited manner (Fig. 4). For each sample analyzed, the following information is recorded and outputted: sample name, sequence type, species, subtype (if applicable), sequence, comments (Table 1), average Phred quality (gp60 only), bit score, query length (bp), query coverage (%), e-value, percent identity and accession number.

The CryptoGenotyper tab-delimited (.txt) output file.

The CryptoGenotyper also generates a text file (.txt) that is tab-delimited with each analysis. (A) For the SSU rRNA gene target analysis, the sample name, analysis mode (forward, reverse, contig), whether the chromatogram had mixed sequences detected, species, sequence, comments, and the BLAST statistics (bit score, query length, query coverage, e-value, percent identity, and accession number of the nearest BLAST hit) is recorded. (B) For the gp60 gene target analysis, the sample name, analysis mode (forward, reverse, contig), species, subtype, sequence, comments, average Phred quality of the chromatograms and the BLAST statistics (similar to the SSU rRNA described) are outputted.

CryptoGenotyper warning messages and their explanations.

If the CryptoGenotyper is unable to decipher the raw data from the Sanger sequence chromatograms, whether due to bad quality or the presence of unusual artifacts, the tool will output a warning message for the sample in question to be manually interpreted (Table 1).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A