ESAP Plus was constructed to identify SSR markers and design the primers from a large EST collection. ESAP Plus includes four main processes to be executed: EST pre-processing, EST clustering and assembly, SSR identification and SSR primer design (Fig. 1). We downloaded and installed standalone version of software tools as well as wrote some shell scripts to handle some tasks as recommended in [9] for these processes. To manage the EST-SSR primer design, we wrote nine in-house shell scripts (Additional file 1). These shell scripts are controlled by the four main core scripts to automate the whole process.
ESAP Plus workflow. The system is based on three-tier architecture including user interface, processing logic, and database tier. Users can interact with the pipeline via the web interface to submit input data and run the pipeline. Processing logic tier consists of multiple tasks which facilitate the analysis of input ESTs. All data and results from the pipeline are stored in the database tier. Users can view and download the stored data though ESAP Plus web interface
EST pre-processing is the first process in the proposed pipeline, developed to screen for high-quality ESTs. EST pre-processing has four sub-processes, (1) EST formatting, (2) Length and %N detection, and EST removal, (3) Vector detection and removal and (4) Low-complexity masking. EST formatting module is responsible for converting multiple raw data formats and merge them into a text file with multiple FASTA entries. A Perl script, called xtract.pl, parses raw EST input sequences and converts them into a combined FASTA text file with a “.txt” extension. The second module takes care of screening high quality EST (sequence with ≥100 bp with < 5% of unknown nucleotides) [9]. A Perl script, called length_N.pl, was written to check the length and number of unknown nucleotides in each EST sequence. Low quality sequences will be removed. The third module is called vector detection and removal, which we utilize the SeqClean software [14] and the NCBI UniVec database [31]. SeqClean searches through 3′ or 5′ ends of input EST sequences and removes those regions that are highly similar (>92% identity) to vector, adaptor, primer or linker sequences listed in the UniVec database. The Low-complexity masking module identifies repeat sequences and masks them for removal. To do this, we installed RepeatMasker [16] and included this utility in our pipeline to check the EST sequences from the previous module. RepeatMasker uses RMBLAST (version 2.2.28) to perform the search against the RepBase database [30] for interspersed repeats, repetitive elements and low-complexity DNA sequences. RepeatMasker provides filtering options to identify repetitive elements by users such as DNA source of RepBase, masking and repeat options, or user can use default parameter. The default parameter is set DNA source as human, masking option as repetitive sequence replaced in lowercase, and repeat option as masked interspersed and simple repeats. The EST containing low-complexity region were automatically removed by in-house PHP script of the pipeline.
High quality EST data from the pre-processing stage will be passed to EST clustering and assembly stage. In this part, there are two alternative workflows using two different algorithms, including CD-HIT-EST [18] and TGICL [17]. CD-HIT-EST clusters ESTs and then chooses NR cluster containing non-redundant EST candidates. TGICL produces non-redundant assembled sequences (AS), which are the consensus sequences from both contigs and singletons. The EST clustering cutoff parameters of both CD-HIT-EST and TGICL are adjustable (the default parameter which is set to 95% identity). The resulting non-redundant EST candidates from either CD-HIT-EST or TGICL will be the input of the following SSR mining step.
We offer two different algorithms, namely MISA [20] and RepeatMasker [16] for the SSR mining step. MISA can identify both perfect SSRs and compound SSRs (being interrupted by a certain number of bases) [20]. MISA provides users to set parameter to identify SSR or use the default parameter of MISA as follows: a candidate SSR must have at least six di-nucleotide repeats and five tri-, tetra-, penta- and hexa-nucleotide repeats. We also identify candidate SSRs using the RepeatMasker software. The results from both algorithms will be used to design EST-SSR primer pairs.
EST-SSR sequences obtained from SSR mining of both MISA and RepeatMasker will be sent to BatchPrimer3 [28] that utilizes Primer3 core [26] to design primers. To reduce resulting false positive primers, BatchPrimer3 incorporates SSR filtering that uses SSRIT algorithm [21] to select high quality template for primer design. BatchPrimer3 provides users to set parameter for SSR screening and primer design or use default parameter. The default cutoff parameter of SSR screening is set to have at least six di-nucleotide repeats and five tri-, tetra-, penta- and hexa-nucleotide repeats. The default parameters of BatchPrimer3 to primer design are set as follows: 150–300 bp product size, with 18–27 bp of primer size, primer temperature minimum at 57 °C and maximum at 63 °C, primer GC% minimum at 50 °C and maximum at 80 °C. The primer design results along with other intermediate results produced by the proposed pipeline will be stored in the ESAP Plus MySQL database.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.