Raw reads were produced through base calling and stored in fastq format. The raw data became clean after filteration by removing the adapter sequences, reads with unkown nucleotides (N) more than 10% and low quality sequences (base quality score<Q20). Clean data with high quality were the basis of following analyses.
De novo assembly of clean reads was carried out using Trinity software (http://trinityrnaseq.sf.net). Trinity was usually consisted of three independent software modules: Inchworm, Chrysalis and Butterfly. Using this software, sequencing data were partitioned into many single de Bruijn graphs (each represented transcriptional complexity for a given gene). Full-length splicing isoforms and transcripts from paralogous genes were obtained after the independent processing of graphs. At this period, the k-mer value was set to 25. The longest transcript from the same component was only preserved as a contig for excluding the interference from alternative splicing of transcripts. The assembled sequences were defined as unigenes.
The prediction of unigenes was performed by mapping to protein-coding sequences using GetORF of EMBOSS [27]. The predicted protein-coding sequences were annotated to the NCBI non-redundant (Nr) protein database and UniProtKB database, using BLASTp with algorithm with an E-value threshold of 1e-5. Gene encoding protein domains were identified by searching against Swiss-Prot and TrEMBL databases through BLASTp program. GO function for all unigenes was classified using Gene2go of GoPipe program [28]. KEGG (Kyoto Encyclopedia of Genes and Genomes) metabolic pathway annotation and COG (Clusters of Orthologous Groups) classification of unigenes were determined by searching against KEGG database and COG database using BLAST algorithm, respectively.
Differential expression of unigenes in the two turbot libraries was analyzed using the MA-plot-based method with Random Sampling model (MARS) in DEGseq R package [29]. P value was adjusted by means of q value. Q value<0.001&|log2 (fold change)| >1 was set as the threshold for significantly differential expression. GO enrichment analysis of the differentially expressed genes (DEGs) was performed using a hyper geometric distribution test. GO term with false discovery rate (FDR)≤0.01 was defined as the term of significantly enriching DEGs.
Molecular markers, including SSRs and SNPs, were detected after mapping all clean reads to the assembled transcripts. The set of unique sequences was searched for SSR markers using MISA (http://pgrc.ipk-gatersleben.de/misa/misa.html). The minimum repeat number used for this search was eight for dinucleotide, five for tri-, four for tetra- and three for penta- and hexanucleotide microsatellites. SSR-containing ESTs were identified as candidates for marker development if they presented enough flanking sequences on either side of the repeats for primer design using Primer 3 (http://primer3.sourceforge.net/releases.php). Putative SNP detection was performed with SOAPsnp software. For identification of potential SNPs, various parameters such as base quality score and read depth were optimized. The following criteria were selected as the final SNP sets: read depth of four and the minimum variant frequency of two, variations compared to the consensus sequence were counted as SNPs. Furthermore, they were considered statistically significant at FDR/tested p-value<0.1.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.