Reads processing and identification of differentially expressed genes (DEGs)

Yongyao Fu; Liping Yang; Haihong Gao; Xu Wenji; Qiang Li; Hongqun Li; Jian Gao

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Reads processing and identification of differentially expressed genes (DEGs)

YF Yongyao Fu

LY Liping Yang

HG Haihong Gao

XW Xu Wenji

QL Qiang Li

HL Hongqun Li

JG Jian Gao

This method is extracted from research article: PLoS One, Oct 2020

Comparative transcriptome analysis reveals heat stress-responsive genes and their signalling pathways in lilies (Lilium longiflorum vs. Lilium distichum)

DOI: 10.1371/journal.pone.0239605

Request a Protocol

Ask a question

Favorite

Clean reads were obtained by filtering out adaptor-only reads, trimming reads containing more than 5% unknown nucleotides, and low-quality reads with the percentage of low quality bases (base quality ≤10) using Trimmomatic and we subsquently aligned the clean high-quality reads to the SSU and LSU rRNA sequences using BWA software. After that, we assembled the clean reads from the heat-tolerant L. longiflorum after rRNA removing using de novo assembly program Trinity except K-mer value to conduct the de novo assembly [28]. Additionally, only one read copy will be kept for assembly and redundant duplication reads be eliminated for mutli-duplication’s reads, After that, overlapped nucleic acid sequence were generated to the contigs assembled using Trinity. To obtain the unigene, the paired-end reads were used for constructing scaffolds with the paired end information by realigning to contigs. Then, these contigs in one transcript were assembled by the Trinity and gained the sequence not being extended on either end defined as unigenes. To harvest as much description as possible for the assembled sequences, all unigenes were annotated based on BLAST searches using BLASTx search tool through Swiss-Prot protein databases and the National Center for Biotechnology Information non-redundant protein (Nr) with threshold E-value set as less than 1e-10, identity > 70%, query coverage ≥ 80%, and other parameters were defaulted. In general, we used BLAST alignment of transcripts (or translations of predicted ORFs from transcripts) to reference protein sets as a means of assessing coding transcript completeness. Transcripts with ≥ 80% sequence coverage (i.e. a significant alignment between a transcript sequence from our assembly and a target protein sequence, where the alignment covers at least 80% of the target protein sequence) are thus considered “full or near-full length”. In addition, the conserved ortholog content was identified using the nematod Benchmark Universal SingleCopy Orthologs (BUSCOs, v4.0.6). To further predict their functions, based on Nr and SwissPro BLAST results, the unigenes were then annotated in Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.kegg.jp/) database and Gene Ontology (GO, http://www.geneontology.org/) database. For the nr annotations, the BLAST2GO program was used to assign GO annotations (comprised of biological processes, molecular functions, and cellular components) of unique assembled transcripts [29]. Subsequently, WEGO (Web Gene Ontology Annotation Plot) software was used to conduct GO functional classification for understanding the distribution of gene functions at the macroscopic level [30]. After that, Bowtie2 was adopted to map the clean reads to the de novo assembly transcriptome reference sequences, and based on the mapping of RNA-seq reads to the assembled transcriptome, the developed software RSEM was performed to assess transcript abundances by quantification of the de novo assembly transcript and calculated as the FPKM (fragments per kilobase of transcript per million mapped reads); Expression data from two libraries (treatment and control) were determined by mapping to the transcriptome assembly using Bowtie2 software [31]. The fragments per kilobase of transcripts per million fragments mapped (FPKM) values were analysed further using RESM [32] and PossionDis [33] to get differentially expressed genes (DEGs) between the control and infected groups [(LL_T24h vs. LL_CK and LL_T2h vs. LL_CK for L. longiflorum) and (LD_T24h vs. LD_CK and LD_T2h vs. LD_CK for L. distichum)]. Further, to determine the threshold p-value in multiple tests, a false discovery rate (FDR) was used. Furthermore, significant enrichment was calculated when FDR was <0.05 and FPKM values showed at least a two-fold difference between the two samples reads. Furthermore, DEGs related to heat-stress responsiveness were analysed and plotted using Neighbour–Joining cluster through homemade R script. Followed by the coefficient of variation with a threshold of 1 across the different samples, these genes were selected and then transformed FPKM values by log2.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol