Mapping and Analysis of Illumina Reads for Transcriptome of Medicago Truncatula During the Early Organogenesis of the Nodule

下载 PDF 引用 收藏 提问与回复 分享您的反馈 Cited by



Plant Physiology
Jan 2013


Medicago truncatula serves as a model plant for legume genetics and genomics. We used RNA-Seq to characterize the transcriptome during the early organogenesis of the nodule and during its functioning. We generated approximately 135.5 million high-quality 36-bp reads, which were then aligned with the M. truncatula genome sequence (Mt3.0 version) and with sequences from a custom splice-junction database, for the detection of transcribed regions and splicing sites. Mapping and analysis of the reads conducted to the detection of 37,333 expressed transcription units (TUs), 1,670 had never been described before and were functionally annotated. We identified 7,595 new transcribed regions, mostly corresponding to 5’ and 3’ UTR extensions and new exons associated with 5,264 previously annotated genes. We also assessed the complexity in the nodulation transcriptome by performing a Cufflinks analysis to determine the frequency of the various alternatively spliced forms. Thus, we identified 23,164 different transcripts derived from 6,587 genes. Finally, we carried out a differential expression analysis, which provided a comprehensive view of transcriptional reprogramming during nodulation.

All Illumina sequence data have been deposited in the NCBI short-read archive, and Sanger-sequenced PCR products have been deposited in GenBank (SRA048731). Assembled contigs longer than 200 bp have been deposited at TSA (JR366937-JR375780). Coverage data are available at http://ddlab.sci.univr.it/cgi-bin/gbrowse/medicago/.

Keywords: Mapping illumina reads (Illumina读取映射), Medicago transcriptome (苜蓿转录), Early organogenesis (早期器官), Nodule (结节), Medicago truncatula (蒺藜苜蓿)

Materials and Reagents

  1. mRNA-Seq 8 sample prep kit (Illumina, catalog number: RS-100-0801 )
  2. QIAquick Gel Extraction Kit (QIAGEN, catalog number: 28704 )
  3. RNeasy Plant Mini Kit (QIAGEN, catalog number: 74903 )
  4. Standart agarose for electrophoresis (Sigma-Aldrich, catalog number: A9539 )


  1. Bioanalyzer Chip DNA 1000 series II (Agilent)
  2. Gel electrophoresis apparatus
  3. Illumina Genome Analyzer II (Illumina, model: SY-301-1201 )


  1. Bowtie (Langmead et al., 2009)
  2. BEDTools suite (Quinlan and Hall, 2010)
  3. TopHat (Trapnell et al., 2009)
  4. Cufflinks (Roberts et al., 2011)
  5. Velvet (Zerbino and Birney, 2008)
  6. CAP3 (Huang and Madan, 1999)
  7. GMAP (version 2012-04-21) (Wu and Watanabe, 2005)
  8. Cistematic 2.5 (http://cistematic.caltech.edu/)
  9. ERANGE software (3.1) (Mortazavi et al., 2008)
  10. Medicago3.py: script to import Medicago annotation into cystematic (Boscari et al., 2012)
  11. Gff2knowngene.pl: script to convert from General Feature Format (GFF) format to UCSC knowngene format (Zenoni et al., 2010)
  12. CASAVA (Illumina)
  13. GenomeStudio (Illumina)


  1. Poly(A) mRNA was isolated from the extracted RNA to prepare a nondirectional Illumina RNA-Seq library with mRNA-Seq 8 sample prep kit. We modified the gel extraction step by dissolving excised gel slices in QG buffer of QIAquick Gel Extraction Kit at room temperature to avoid under representation of AT-rich sequences.
  2. Quality control and quantification of each library of 200 bp was performed with a Bioanalyzer Chip DNA 1000 series II.
  3. 36 to 44 bp sequences were generated on an Illumina genome analyzer II. A total of 135 million of reads were obtained for the different conditions.
  4. M. truncatula Genome and the Splice Database Sequence alignments were generated with Bowtie (http://bowtie-bio.sourceforge.net).
  5. Alignment of the reads was made on the Mt3.0 version of the M. truncatula genome sequence (www.medicago.org). For our analysis we allowed up to 2 mismatches, and sequences that matched with more than 10 different loci were discarded. Genome index is built with command “bowtie-build –f medicago.fasta genome” where medicago.fasta is the complete Mt3.0 fasta file. Reads are then aligned with command “bowtie -v 2 –m 10 –k 11 -S genome reads.fastq output.sam”, where reads.fastq are the raw sequencing reads of a sample, and the output.sam was processed using the software suite BedTools (http://code.google.com/p/bedtools/) in order to assign each read to an exon, intron, UTR, or intergenic region. Reads mapped onto external exons fell within a 3-kb catchment from both ends of a gene, promoting the investigation of putative undiscovered exons. Intergenic reads represented those sequence reads that fell outside this catchment. The program ERANGE defined potentially novel clusters of expression on the basis of their alignment; they were categorized as novel sections (exons/UTRs) of a known gene if they fell inside a radius of 3,000 bps from them (average gene density/2). The remaining expressed clusters were marked as potential new genes.
  6. In order to identify potential new isoforms of known genes, we remapped all reads against the M. truncatula genome using TopHat (http://tophat.cbcb.umd.edu/) with a segment length of 16 due to the short length of our reads, and defined the new isoforms of known genes performing a Cufflinks (http://cufflinks.cbcb.umd.edu/) analysis on each sample, with standard parameters, followed by an analysis with Cuffcompare to merge transcripts identified on different samples. We used the latest genome sequence and annotations provided by the Medicago research community (Mt3.5, http://www.medicago.org/).
  7. To identify novel transcribed regions, we used the reads which had not been mapped against the Mt3.0 sequence from every sample to assemble separately our contigs, using the Velvet program (http://www.ebi.ac.uk/~zerbino/velvet/), using a sensitive hash length of 29 for the reads with a length of 44 bps and of 21 for the rest. The contigs were subsequently clustered together using the software CAP3 (http://bioinformatics.ca/links_directory/tool/9319/cap3-sequence-assembly-program), with a minimum overlap of 90%, requiring an overlap identity of 80%. Contigs mapping against the reference genome with identity ≥ 90% and coverage ≥ 90% after BLAT alignment were discarded from further analysis. All the contigs were also mapped against the accompanying RNA-Seq data of the Mt3.5 version with GMAP (version 2012-04-21). The contigs, with an alignment coverage on the sequence length >= 90% and on the identity >= 90%, were merged together using the program mergeBed from the BEDTools suite (http://code.google.com/p/bedtools/).
  8. The evaluation of gene expression was performed with the ERANGE software (3.1), available at http://woldlab.caltech.edu/RNA-Seq. ERANGE requires Cistematic 2.5 to execute RunStandardanalysis.sh. Therefore, a Python script (medicago3.py) was developed to import M. truncatula reference sequence (Mt3.0) and annotation in General Feature Format (GFF) into Cistematics Genomes sqlite database, and a Perl script (gff2knowngene.pl) was used to convert the GFF annotation file to the knowngene.txt file used by RunStandardanalysis.sh. ERANGE reports the number of mapped reads per kilobase of exon per million mapped reads, measuring the transcriptional activity for each gene. To obtain an accurate measure of gene expression not biased by reads mapping to splice junctions in genes with many introns, ERANGE considers both reads mapping to genome or to the custom splice junctions database. ERANGE was preferred over commercial packages such as CASAVA and GenomeStudio platform from Illumina because of its open nature. This allowed us to adapt and reuse code for our own analysis with greater flexibility than a comparable closed source commercial package.
  9. Differential Gene Expression Statistic for RNA-Seq. ERANGE software computes the normalized gene locus expression level (named RPKM) by assigning reads to their site of origin and counting them. In the case of reads that match equally well to several sites, ERANGE assigns them proportionally to their most likely site(s) of origin (Mortazavi et al., 2008). The RPKM value for a given gene locus can be estimated as follows:
    RPKM = N/(L * NTot) * 109
    Where N = number of mapping reads at a given gene locus, L = estimated length (bp) of the gene locus, NTot = number of total mapping reads, and 109 is correspond to 1,000 bp transcript multiple 1 million reads. The null hypothesis of no differential gene expression for each gene was tested using the R package qvalue (Storey, 2002; Storey and Tibshirani, 2003; Dabney et al., 2010) on the R working environment. False Discovery Rates were calculated based on p-values obtained running a t test on the raw read counts using the basic R package stats.
    Variance = (1/RPKM1) + (1/RPKM2)
    Stat = (log(RPKM1/NTot1) - log(RPKM2/NTot2))/√(Variance)
    p.value = (1-pnorm(abs(Stat))) * 2
    The threshold value for the FDR was 0.001 and genes were first selected using this filter.
    Differentially expressed genes were then filtered again based on a Fold Change (FC) > 2.


This protocol is adapted from Boscari et al. (2013).


  1. Boscari, A., Del Giudice, J., Ferrarini, A., Venturini, L., Zaffini, A. L., Delledonne, M. and Puppo, A. (2013). Expression dynamics of the Medicago truncatula transcriptome during the symbiotic interaction with Sinorhizobium meliloti: which role for nitric oxide? Plant Physiol 161(1): 425-439.
  2. Huang, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Res 9(9): 868-877.
  3. Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3): R25.
  4. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7): 621-628.
  5. Quinlan, A. R. and Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6): 841-842.
  6. Roberts, A., Pimentel, H., Trapnell, C. and Pachter, L. (2011). Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 27(17): 2325-2329.
  7. Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9): 1105-1111.
  8. Wu, T. D. and Watanabe, C. K. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21(9): 1859-1875.
  9. Zenoni, S., Ferrarini, A., Giacomelli, E., Xumerle, L., Fasoli, M., Malerba, G., Bellin, D., Pezzotti, M. and Delledonne, M. (2010). Characterization of transcriptional complexity during berry development in Vitis vinifera using RNA-Seq. Plant Physiol 152(4): 1787-1795.
  10. Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5): 821-829.


Medic属truncatula 是豆科植物遗传学和基因组学的模型植物。我们使用RNA Seq表征转录组在结节的早期器官发生期间和其功能期间。我们产生了大约135.5百万高质量的36-bp读数,然后与M比对。 truncatula基因组序列(Mt3.0版本)和来自定制剪接结数据库的序列,用于检测转录区和剪接位点。对进行到37,333个表达的转录单位(TU)的检测的读取的映射和分析,1,670以前从未被描述过并且被功能性注释。我们确定7,595新转录区,大部分对应于5'和3'UTR扩展和新的外显子与5,264以前注释的基因相关联。我们还通过进行Cufflinks分析以确定各种可变剪接形式的频率来评估结瘤转录组中的复杂性。因此,我们确定23,164不同转录物来自6,587基因。最后,我们进行了差异表达分析,其提供了在结瘤期间转录重编程的全面视图。
所有Illumina序列数据已被保存在NCBI短读档案中,Sanger测序PCR产物已被保藏GenBank(SRA048731)。长于200bp的装配的重叠群已经保藏在TSA(JR366937-JR375780)。覆盖率数据可在 http://ddlab.sci.univr.it/cgi -bin/gbrowse/medicago/

关键字:Illumina读取映射, 苜蓿转录, 早期器官, 结节, 蒺藜苜蓿


  1. mRNA-Seq 8样品制备试剂盒(Illumina,目录号:RS-100-0801)
  2. QIAquick凝胶提取试剂盒(QIAGEN,目录号:28704)
  3. RNeasy Plant Mini Kit(QIAGEN,目录号:74903)
  4. 用于电泳的标准琼脂糖(Sigma-Aldrich,目录号:A9539)


  1. 生物分析芯片DNA 1000系列II(安捷伦)
  2. 凝胶电泳仪
  3. Illumina基因组分析仪II(Illumina,型号:SY-301-1201)


  1. Bowtie(Langmead等人,,2009)
  2. BEDTools套件(Quinlan和Hall,2010)
  3. TopHat(Trapnell等人,,2009)
  4. 袖扣(Roberts等人,2011)
  5. Velvet(Zerbino和Birney,2008)
  6. CAP3(Huang and Madan,1999)
  7. GMAP(版本2012-04-21)(Wu和Watanabe,2005)
  8. Cistematic 2.5( http://cistematic.caltech.edu/
  9. ERANGE软件(3.1)(Mortazavi等人,2008)
  10. Medicago3.py:将Medicago注释导入cystematic的脚本(Boscari等人,2012)
  11. Gff2knowngene.pl:将通用要素格式(GFF)格式转换为UCSC已知基因格式的脚本(Zenoni等人,2010年)
  12. CASAVA(Illumina)
  13. GenomeStudio(Illumina)


  1. 从提取的RNA中分离Poly(A)mRNA,以制备具有mRNA-Seq 8样品制备试剂盒的无方向Illumina RNA-Seq文库。我们通过在室温下将切下的凝胶切片溶解在QIAquick凝胶提取试剂盒的QG缓冲液中来修改凝胶提取步骤,以避免在富含AT序列的表现下。
  2. 使用Bioanalyzer Chip DNA 1000系列II进行200bp的每个文库的质量控制和定量
  3. 在Illumina基因组分析仪II上产生36至44bp序列。在不同条件下获得了总计1.35亿的读数
  4. M。 truncatula 基因组和剪接数据库序列比对是使用Bowtie生成的( http://bowtie-bio.sourceforge .net )。
  5. 在Mt的3.0版本上进行读取的比对。 truncatula 基因组序列( www.medicago.org )。对于我们的分析,我们允许最多2个错配,并且与超过10个不同基因座匹配的序列被丢弃。基因组索引是用命令"bowtie-build -f medicago.fasta基因组"构建的,其中medicago.fasta是完整的Mt3.0 fasta文件。然后读取与 命令"bowtie -v 2 -m 10 -k 11 -S genome reads.fastq output.sam",其中reads.fastq是样品的原始测序读数,output.sam使用软件套件BedTools(< a target ="_ blank"href ="http://code.google.com/p/bedtools/"> http://code.google.com/p/bedtools/),以便分配每个阅读到外显子,内含子,UTR或基因间区。映射到外部外显子上的读数落在来自基因两端的3-kb流域内,促进对推定的未发现的外显子的研究。基因间读数表示落在该流域外的那些序列读数。程序ERANGE基于它们的比对定义潜在的新的表达簇;如果它们落入距离它们的3,000bp半径内(平均基因密度/2),则它们被归类为已知基因的新区段(外显子/UTR)。剩余的表达簇被标记为潜在的新基因
  6. 为了鉴定已知基因的潜在的新同种型,我们将所有读取重新映射到m。 truncatula 基因组( http://tophat.cbcb.umd.edu/)由于我们的读取的短的长度,片段长度为16,并定义了已知基因的新同种型执行Cufflinks( http://cufflinks.cbcb.umd.edu/)分析每个样品,标准参数,然后用Cuffcompare分析合并不同样品上确定的转录物。我们使用了由Medicago研究机构提供的最新基因组序列和注释(Mt3.5, http://www.medicago.org)/)。
  7. 为了鉴定新的转录区,我们使用未针对来自每个样品的Mt3.0序列进行作图的读取,使用Velvet程序(),使用敏感的散列长度为29来读取长度的44 bps和其余的21。使用软件CAP3( http://生物信息学)将重叠群随后聚类在一起。 ca/links_directory/tool/9319/cap3-sequence-assembly-program ),具有90%的最小重叠,需要80%的重叠身份。从进一步分析中丢弃BLAT比对后相对于参照基因组的同一性≥90%和覆盖≥90%的重叠群。所有的重叠群也对照伴随的具有GMAP的Mt3.5版本的RNA-Seq数据(版本2012-04-21)。使用来自BEDTools套件的程序mergeBed(http://code.google.com/p/bedtools/)。
  8. 基因表达的评价用ERANGE软件(3.1)进行,可在 http://woldlab.caltech获得。 edu/RNA-Seq 。 ERANGE需要Cistematic 2.5来执行RunStandardanalysis.sh。因此,开发了一个Python脚本(medicago3.py)来导入 M。 truncatula 参考序列(Mt3.0)和通用特征格式(GFF)中的注释到Cistematics Genomes sqlite数据库中,并且使用Perl脚本(gff2knowngene.pl)将GFF注释文件转换为knowngene.txt文件使用RunStandardanalysis.sh。 ERANGE报告每百万个映射读取的每千碱基外显子的定位读取的数目,测量每个基因的转录活性。为了获得基因表达的精确测量,不偏向映射到具有许多内含子的基因中的剪接点的读数,ERANGE考虑了与基因组或定制剪接点数据库映射的两个读数。 ERANGE比商业包装更受欢迎,例如来自Illumina的CASAVA和GenomeStudio平台,因为其开放性。这使我们能够适应和重用代码,用于我们自己的分析,具有比类似的封闭源商业软件包更大的灵活性
  9. 差异基因表达统计RNA-Seq。 ERANGE软件通过将读数分配给它们的原位并对它们计数来计算标准化的基因座位表达水平(称为RPKM)。在读取与几个位点同样良好的情况下,ERANGE将它们与它们最可能的起始位点成比例地分配(Mortazavi等人,2008)。给定基因座的RPKM值可以如下估计:
    RPKM = N /(L * N <10> 9
    其中N =给定基因座位处的作图读数数目,L =基因座位的估计长度(bp),N sub =总映射读数数目,以及10 对应于1,000bp转录物多个1百万读数。使用R包q值(Storey,2002; Storey和Tibshirani,2003; Dabney等人,2010)对R工作环境测试每个基因无差异基因表达的零假设。基于使用基本R包统计对原始读取计数进行t检验所获得的p值计算假发现率。
    Stat =(log(RPKM1/N p.value =(1-pnorm(abs(Stat)))* 2
    FDR的阈值为0.001,并且首先使用该过滤器选择基因 然后基于倍数变化(FC)再次过滤差异表达的基因。 2.




  1. Boscari,A.,Del Giudice,J.,Ferrarini,A.,Venturini,L.,Zaffini,A.L.,Delledonne,M.and Puppo,A。(2013)。 苜蓿藜转录组在与中华根瘤菌共生相互作用期间的表达动力学 :一氧化氮的作用?植物生理学 161(1):425-439。
  2. Huang,X。和Madan,A。(1999)。 CAP3:DNA序列装配程序 Genome Res 9(9):868-877
  3. Langmead,B.,Trapnell,C.,Pop,M。和Salzberg,S.L。(2009)。 短DNA序列与人类基因组的超快速和记忆效率的比对 Genome Biol 10(3):R25。
  4. Mortazavi,A.,Williams,B.A.,McCue,K.,Schaeffer,L.and Wold,B.(2008)。 通过RNA-Seq映射和定量哺乳动物转录组 em> 5(7):621-628。
  5. Quinlan,A.R。和Hall,I.M。(2010)。 BEDTools:用于比较基因组特征的灵活的实用程序套件。生物信息学 26(6):841-842。
  6. Roberts,A.,Pimentel,H.,Trapnell,C.and Pachter,L。(2011)。 使用RNA-Seq在注释的基因组中鉴定新的转录物 /em> 27(17):2325-2329。
  7. Trapnell,C.,Pachter,L。和Salzberg,S.L。(2009)。 TopHat:使用RNA-Seq发现剪接点 生物信息学 25(9):1105-1111。
  8. Wu,T.D.and Watanabe,C.K。(2005)。 GMAP:mRNA和EST序列的基因组作图和比对程序。 Bioinformatics 21(9):1859-1875。
  9. Zenoni,S。,Ferrarini,A.,Giacomelli,E.,Xumerle,L.,Fasoli,M.,Malerba,G.,Bellin,D.,Pezzotti,M.and Delledonne, 使用RNA-Seq在葡萄酿酒酵母发育过程中转录复杂性的表征。 Plant Physiol 152(4):1787-1795。
  10. Zerbino,D.R。和Birney,E。(2008)。 天鹅绒:使用de Bruijn图表进行短时读取汇编的算法。 Genome Res 18(5):821-829。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2013 The Authors; exclusive licensee Bio-protocol LLC.
引用: Readers should cite both the Bio-protocol article and the original research article where this protocol was used:
  1. Boscari, A., Ferrarini, A., Giudice, J. d., Venturini, L., Delledone, M. and Puppo, A. (2013). Mapping and Analysis of Illumina Reads for Transcriptome of Medicago Truncatula During the Early Organogenesis of the Nodule. Bio-protocol 3(22): e966. DOI: 10.21769/BioProtoc.966.
  2. Boscari, A., Del Giudice, J., Ferrarini, A., Venturini, L., Zaffini, A. L., Delledonne, M. and Puppo, A. (2013). Expression dynamics of the Medicago truncatula transcriptome during the symbiotic interaction with Sinorhizobium meliloti: which role for nitric oxide? Plant Physiol 161(1): 425-439.