1 user has reported that he/she has successfully carried out the experiment using this protocol.
Experimental Pipeline for SNP and SSR Discovery and Genotyping Analysis of Mango (Mangifera indica L.)

引用 收藏 提问与回复 分享您的反馈 Cited by



BMC Plant Biology
Dec 2015


Establishing a reservoir of polymorphic markers is an important key for marker-assisted breeding. Many crops are still lack of such genomic infrastructure. Single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs) are useful as markers because they are widespread over the genome and many technologies were developed for high throughput genotyping. We present here a pipeline for developing a reservoir of SNP and SSR markers for Mangifera indica L. as an example for fruit tree crops having no genomic information available. Our pipeline includes de novo assembly of reference transcriptome with MIRA and CAP3 based on reads produced by 454-GS FLX technology; Polymorphic loci discovery by alignment of Illumina resequencing to the transcriptome reference; Identifying a subset of loci that are polymorphic in the entire germplasm collection for downstream diversity analysis by genotyping with Fluidigm technology.

Keywords: SNP discovery (SNP的发现), Diversity (多样性), Marker-assisted selection (分子标记辅助选择), SSR (SSR)


Considerations of high-throughput sequencing: This pipeline does not include RNA/DNA extraction and other molecular biology lab protocols for next generation sequencing (NGS). It is common to outsourcing NGS. Therefore, it includes DNA preparation for genotyping only. Before describing the pipeline below, we would like to comment about the considerations regarding the sequencing.

Assumption: In this pipeline, we assume a non-model organism which has no genomic infrastructure at all. For marker discovery, one will need a reference and resequencing to discover the polymorphism. The ultimate reference is a genome. However, due to the fact that having a good draft or a complete reference genome is still expensive task our recommendation is to sequence a reference transcriptome from a pool of tissues. The pool of tissues should compensate the unequal gene representation as a result of tissue-specific expression.

Technology: For the purpose of a reference transcriptome sequencing, 454-GS Flx Titanium or any long reads NGS technology is preferred. For marker discovery by resequencing, a pool of genomic DNA (gDNA) from the population under study is a cost-effective solution. Polymorphic loci in such pool are a representative sample of the polymorphic loci in the population. Here the important factor is the reads’ depth which should strive to an average coverage of 50x and no less than 20x. In a case of large genomes the choice of gDNA resequencing might be too expensive to get coverage of 50x. Alternatively, mRNA extraction of a pool of tissues and population individuals would be a cheaper option.

The aim of this protocol is to provide a pipeline (Figure 1) for the bioinformatics and genomics support unit that assist the breeder of a crop which has no genomic information to establish a set of polymorphic SNP and SSR markers. This set can be used for marker-assisted breeding studies as well as for exploring the diversity in the crop’s germplasm collection diversity.

Figure 1. Flowchart of a pipeline for marker discovery. The reference transcriptome here (represented as a database shape) is the link connecting function annotation with genetic variation.

Materials and Reagents

  1. 50 ml Falcon tube
  2. Young leaf tissue
  3. Tris (Amresco, catalog number: 77-86-1 )
  4. EDTA (Sigma-Aldrich, catalog number: E5134 )
  5. NaCl (Sigma-Aldrich, catalog number: S3014 )
  6. 3% CTAB (Hexadecylrimethylammonium bromide) (Sigma-Aldrich, catalog number: H5882 )
  7. 2% polyvinylpyrolidone (PVP) (MW 40,000) (Amresco, catalog number: 9003-39-8 )
  8. 1% β-mercaptoethanol (Sigma-Aldrich, catalog number: M3148 )
  9. 5 M ammonium acetate (Sigma-Aldrich, catalog number: A1542 )
  10. Chloroform:isoamyl alcohol mix [24:1 (v:v)]
  11. Isopropanol (stored at -20 °C)
  12. Ethanol
  13. RNase A (> 70 Kunit/mg protein, > 20 mg protein/ml) (Sigma-Aldrich, catalog number: R4642 )
  14. Extraction buffer (see Recipes)
  15. TE buffer (see Recipes)


  1. 65 °C water bath
  2. 37 °C water bath/block
  3. IKA-A11 analytical grinding mill (IKA®-Werke GmbH & Co. KG)
  4. Cooled centrifuge (Sorvall RC5plus) with Fixed Angle Rotor (FiberliteTM F13-14 x 50cy) (Thermo Fisher Scientific, catalog number: 096-1450 ).
  5. Agarose gel apparatus
  6. Nanodrop spectrophotometer
  7. Recommended hardware specifications (for bioinformatics pipeline)
    1. CPU
      Architecture: x86_64
      CPU op-mode(s): 64-bit, 8 cores, Thread(s) per core: 2
      Vendor ID: GenuineIntel
      CPU MHz: 1596.000
    2. Memory
      MemTotal: 48 GB
      SwapTotal: 4GB


  1. “Sff_extract” (https://bioinf.comav.upv.es/sff_extract/) – Converting and preprocessing, e.g., adapter removal and base-call clipping 454-GS FLX raw files to text formats (fasta and quality).
    Note: Sff_extract is now part of the tool set seq_crumbs (https://bioinf.comav.upv.es/seq_crumbs/)
  2. MIRA (https://sourceforge.net/projects/mira-assembler/) – A multi-pass DNA sequence data assembler/mapper for whole genome and/or transcriptome projects. MIRA is a multi-platforms assembler capable assembling reads from a combination of platforms or from each platform separately.
  3. CAP3 (http://seq.cs.iastate.edu/cap3.html) – CAP3 is for small-scale assembly of sequences with or without quality values.
  4. Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) – Trimmomatic is a fast, multi-threaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters.
  5. FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) – Preprocessing, e.g., adapter removal and base-call clipping, short reads (fastq files).
    Note: FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
  6. Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) – Alignment of short reads to a reference genome/transcriptome.
  7. Samtools (http://www.htslib.org/) – SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAMTools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
  8. Getorf (http://emboss.sourceforge.net/download/) – EMBOSS tool for identification of open reading frame ORF in mRNA sequence.
  9. MIcroSAtellite (MISA) identification tool (http://pgrc.ipk-gatersleben.de/misa) – This tool allows the identification and localization of perfect microsatellites as well as compound microsatellites which are interrupted by a certain number of bases.
  10. SciRoKo (http://kofler.or.at/bioinformatics/SciRoKo/index.html) – A tool for fast whole-genome microsatellite search. For example, the whole rice genome may be searched in 55 sec.
  11. VarScan (http://varscan.sourceforge.net/) – VarScan is a platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments.

Data analysis

  1. Data
    1. 454-GS FLX Titanium mRNA contigs were deposited in transcriptome shotgun assembly (TSA) repository of NCBI: accession No. GBJO00000000 (Sherman et al., 2015).
    2. Illumina short reads were deposited in short reads archive (SRA) of NCBI: experiment accession No. SRX651793 GBJO00000000 (Sherman et al., 2015).

  2. De novo transcriptome assembly
    1. Raw sequence reads of the 454-GS FLX Titanium platform were pre-processed by “Sff_extract” (https://bioinf.comav.upv.es/sff_extract/) and arguments for removing the adaptors and clipping the poly-A were applied.

    2. De novo assembly with MIRA 3.2 (Chevreux et al., 2004)

    3. Reduction of contig variability (merging transcript variants) by running Cap3 and creating super-contigs
      Note: Cap3 is downloaded separately from MIRA (see Software list section).

      Note: mango.fasta and mango.qual are output files of MIRA, created in the mango_d_results directory.

    4. Filtering out contigs with length less than 200 bp
      Refer to the fasta file from here on as reference.transcriptome.contigs.fasta

  3. Functional annotation 
    1. Identifying the coding region to annotate marker position, i.e., inside or outside coding sequence. Finding open reading frames (ORFs) by the “getorf” program of the EMBOSS package (Rice et al., 2000). The longest ORF with start and stop codons was chosen for each contig (-find 1) with a minimum cutoff of 50 amino acids (-minisize 150).
      Note: The argument (-minimize) is given in base pairs (50 bp x 3 = 150 bp).

    2. Reference transcriptome contigs annotation to connect variability with functionality using Blast 2GO (Gotz et al., 2008).
      Blast2GO GUI options:
      1. Start → load sequences (e.g., fasta)
      2. Blast → Run Blast Description Annotator
      3. Mapping → Run mapping
      4. Annot → Run annotation
      5. InterPro → Run interproscan

  4. SNP and SSR discovery
    Adapter removal and low-quality base pairs clipping are performed by Trimmomatic (Bolger et al., 2014) and FASTX (http://hannonlab.cshl.edu/fastx_toolkit/).
    Note: Optional but highly recommended if the alignment is performed on RNA-Seq.

    1. Use R1 and R2 of trimmed pair files, i.e., sample_name_pair_L001_R1.fastq, sample_name_pair_L001_R2.fastq for downstream analysis.
    2. Alignment of resequencing of Illumina HiSeq-2000 reads to the transcriptome reference with bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).

    3. Running samtools (http://www.htslib.org/) and VarScan (Koboldt et al., 2009) for SNP discovery
      Note: The criteria for selecting an SNP subset are dependent on the project. However, a few criteria are advisable to ensure confident SNP loci in any project:
      1. No other SNPs in the flanking regions (100 bp each side) to enable of primer design for further analyses (Absolute differences of the SNP value between the previous and next values in ‘Position’ column should > 100).
      2. Only one SNP per reference-transcriptome contig (unique value at ‘Chrom’ column).
      3. Bi-allelic confidence (‘SamplesHet’ value > 0).

    4. SSR discovery within the contigs the reference transcriptome.
      MIcroSAtellite (MISA) identification tool (http://pgrc.ipk-gatersleben.de/misa) and SciRoKo (Kofler et al., 2007) are run with default parameters.

    5. Find the intersection between two tables by importing them into MS-Access or SQLite and run a SQL inner join command on contig-name, motif, and start-position.

    6. Genotyping with Fluidigm
      1. Large-scale genomic DNA extraction for sample genotyping isolated from young leaves
        1. Young developing Mango (Mangifera indica L.) leaves were collected from the orchard, frozen in liquid nitrogen and stored at -80 °C until used.
        2. β-mercaptoethanol was added to extraction buffer which was pre-heated to 65 °C in a pre-warmed water bath.
        3. 2 g of young leaf tissue was ground to a fine powder using IKA-A11 analytical grinding mill or with a mortar and pestle.
        4. Ground tissue was transferred to a 50 ml Falcon tube and extracted with 15 ml pre-warmed extraction buffer. Extraction was performed by incubation for 30 min at 65 °C, with occasional mixing of the tube.
        5. 15 ml of chloroform:isoamyl alcohol mix (24:1, v:v) was added to tubes. Samples were mixed and centrifuged at 17,000 x g for 10 min at 4 °C.
        6. The aqueous phase was transferred to a new 50 ml tube, and re-extracted with 15 ml of chloroform:isoamyl alcohol mix (24:1, v:v). Centrifugation was performed as above.
        7. The aqueous phase was transferred to a new tube. 1 volume of ice-cold isopropanol was added, tubes were mixed and DNA was precipitated by centrifugation at 17,000 x g for 20 min at 4 °C.
        8. Supernatant was carefully disposed of. Pellet was washed with 70% ice-cold ethanol, and centrifuged at 17,000 x g for 10 min at 4 °C.
        9. Supernatant was carefully disposed of. Pellet was left to dry at room temperature until it turns translucent), and suspended in 3 ml of TE buffer.
        10. DNA solution was treated with 3 μl RNase A, and incubated for 30 min in 37 °C.
        11. DNA is precipitated by adding 1/10 volume of 5 M ammonium acetate, and 2/5 volumes of cold 100% ethanol. Tubes are mixed and centrifuged at 17,000 x g for 10 min at 4 °C.
        12. The supernatant was carefully disposed of. Pellet was washed with 70% ice-cold ethanol, and centrifuged at 17,000 x g for 10 min at 4 °C.
        13. Pellet is air dried and final DNA is suspended in 200 μl of double distilled water or TE buffer.
        14. DNA concentration and quality is analyzed on a 0.7% TAE agarose gel and with a Nanodrop spectrophotometer.
      2. Genotyping on Fluidigm – EP1 Fluidigm standard protocols for FR96.96 chip with four no-template controls (NTCs) instead of one.
        Briefly, the protocol is divided into two major sub-protocols – pre-amplification and the assay itself:
        1. First, specific target amplification (STA) protocol is performed to have an approximately equal proportion from each target by running the following steps:
          1) Preparing the 10x SNPtype STA. Primer pool for 96 assays.
          2) Performing STA on a PCR machine.
          3) Dilution of samples (the outcome will be used in stage 4 of the second part).
        2. Second, the assay of genotyping by specific target primers is performed in a Fluidigm 96.96 dynamic genotyping array on the EP1 platform as follow:
          1) Priming the 96.96 Dynamic ArrayTM IFC.
          2) Preparing SNPtype assays mixes.
          3) Preparing 10x Assays.
          4) Preparing Sample Pre-Mix and Samples.
          5) Loading the Chip.
          6) Using the FC1TM Cycler.
          7) Using the EP1TM Reader Data Collection Software.
          8) Extracting the data for downstream bioinformatics analysis.
        The full protocol description can be found at (http://www.mscience.com.au/upload/pages/fluidigmtech/fluidigm-snp-genotyping-user-guide-151112.pdf).
    7. Filtering qualified SNPs for diversity analysis
      Fluidigm genotype calls are divided into four categories by visual inspection:
      1. Filtering out SNPs with a Category ≥ 2 (Table 1).
      2. Filtering out SNPs with more than 10% no calls.
      3. Filtering out samples with more than 33% no calls.
      4. Filtering out markers with PIC < 0.1.
      5. Filtering out markers with more than 90% of the samples have the same call, i.e., segregating exactly the same.
      6. Filtering out markers with less than 2 samples in each genotype.
      7. Leaving only one marker from each pair of linked markers (R^2 > 0.7).
      8. Leaving only one sample from group of sample having identical genotype. (Identity ≥ 0.95)
      1. These steps can be performed with any programming language, e.g., R, python, perl, C, etc. or SQL.
      2. PIC is calculated as PIC = 1- ∑ pi^2; i = a, A
      3. R^2 is calculated as r^2 = D^2/(p1*p2*q1*q2); D = (p11*p22)-(p12p21 p11,p22,p12,p21 are the proportions of all possible combinations of two bi-allelic loci.

      Table 1. Quality scores of each locus genotyping calls given by visual inspection.


  1. Extraction buffer
    100 Tris, pH 8
    20 M EDTA
    1.5 M NaCl
    3% hexadecylrimethylammonium bromide (CTAB)
    2% polyvinylpyrolidone (PVP)
    1% β-mercaptoethanol
    All solution except β-mercaptoethanol are dissolved by stirring over several hours, and autoclaved. β-mercaptoethanol is added just prior to tissue extraction.
  2. TE buffer
    10 mM Tris, pH 8.0
    1 mM EDTA


The protocol has been developed in a study which was supported by the Chief Scientist of Ministry of Agriculture and Rural Development [Grant No.: 203-0859-12].


  1. Bolger, A. M., Lohse, M. and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15): 2114-2120.
  2. Bowtie - An ultrafast memory-efficient short read aligner. JOHNS HOPKINS University.
  3. Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J., Muller, W. E., Wetter, T. and Suhai, S. (2004). Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res 14(6): 1147-1159.
  4. Fluidigm SNP Genotyping Guide. Fluidigm.
  5. FASTX-Toolkit. Hannonlab.
  6. Gotz, S., Garcia-Gomez, J. M., Terol, J., Williams, T. D., Nagaraj, S. H., Nueda, M. J., Robles, M., Talon, M., Dopazo, J. and Conesa, A. (2008). High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res 36(10): 3420-3435.
  7. Koboldt, D. C., Chen, K., Wylie, T., Larson, D. E., McLellan, M. D., Mardis, E. R., Weinstock, G. M., Wilson, R. K. and Ding, L. (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25(17): 2283-2285.
  8. Kofler, R., Schlotterer, C. and Lelley, T. (2007). SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics 23(13): 1683-1685.
  9. MIcroSAtellite identification tool.
  10. Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format. Samtools.
  11. Rice, P., Longden, I. and Bleasby, A. (2000). EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16(6): 276-277.
  12. Sff_extract. Bioinformatics at COMAV.
  13. Sherman, A., Rubinstein, M., Eshed, R., Benita, M., Ish-Shalom, M., Sharabi-Schwager, M., Rozen, A., Saada, D., Cohen, Y. and Ophir, R. (2015). Mango (Mangifera indica L.) germplasm diversity based on single nucleotide polymorphisms derived from the transcriptome. BMC Plant Biol 15: 277.


建立多态性标记的储库是标记辅助育种的重要关键。许多作物仍然缺乏这样的基因组学基础设施。单核苷酸多态性(SNP)和简单序列重复(SSR)可用作标记,因为它们在基因组上广泛存在,并且开发了许多技术用于高通量基因分型。我们在这里提出了用于开发Mangifera indica的SNP和SSR标记库的管道,作为没有基因组信息的果树作物的实例。我们的管道包括基于由454-GS FLX技术产生的读数的具有MIRA和CAP3的参考转录组的

de novo 装配;通过Illumina重排序到转录组参考的比对的多态性基因座发现;通过用Fluidigm技术进行基因分型,鉴定整个种质资源集合中多态的基因座子集。

[背景] 考虑高通量测序:此管道不包括用于下一代测序(NGS)的RNA/DNA提取和其他分子生物学实验室方案。外包NGS是很常见的。因此,它包括仅用于基因分型的DNA制备。在描述下面的流程之前,我们想对排序的注意事项进行评论。
  技术:为了参考转录组测序的目的,优选454-GS Flx Titanium或任何长阅读NGS技术。对于通过重测序的标记物发现,来自正在研究的群体的基因组DNA(gDNA)库是一种具有成本效益的解决方案。这种池中的多态性基因座是群体中多态性基因座的代表性样品。这里重要的因素是读取深度,应努力达到50x和不小于20x的平均覆盖。在大基因组的情况下,gDNA重测序的选择可能太昂贵以获得50x的覆盖。或者,组织池和群体个体的mRNA提取将是更便宜的选择。


关键字:SNP的发现, 多样性, 分子标记辅助选择, SSR


  1. 50ml Falcon管
  2. 年轻叶组织
  3. Tris(Amaresco,目录号:77-86-1)
  4. EDTA(Sigma-Aldrich,目录号:E5134)
  5. NaCl(Sigma-Aldrich,目录号:S3014)
  6. 3%CTAB(十六烷基三甲基溴化铵)(Sigma-Aldrich,目录号:H5882)
  7. 2%聚乙烯吡咯烷酮(PVP)(MW 40,000)(Amaresco,目录号:9003-39-8)
  8. 1%β-巯基乙醇(Sigma-Aldrich,目录号:M3148)
  9. 5M乙酸铵(Sigma-Aldrich,目录号:A1542)
  10. 氯仿:异戊醇混合物[24:1(v:v)]
  11. 异丙醇(储存于-20℃)
  12. 乙醇
  13. RNAse A(> 70Kunit/mg蛋白,> 20mg蛋白/ml)(Sigma-Aldrich,目录号:R4642)
  14. 提取缓冲液(参见配方)
  15. TE缓冲区(参见配方)


  1. 65°C水浴
  2. 37°C水浴/块
  3. IKA-A11分析磨碎机(IKA -Werke GmbH& Co.KG)
  4. 具有固定角转子(Fiberlite TM F13-14×50cy)(Thermo Fisher Scientific,目录号:096-1450)的冷却离心机(Sorvall RC5plus)。
  5. 琼脂糖凝胶装置
  6. 纳米分光光度计
  7. 推荐硬件规格(生物信息学管道)
    1. CPU
      CPU MHz:1596.000
    2. 内存
      MemTotal:48 GB


  1. "Sff_extract"( https://bioinf.comav.upv.es/sff_extract)/) - 转换和预处理, 例如。,适配器删除和基本调用剪辑454-GS FLX原始文件到文本格式(fasta和quality)。
    注意:Sff_extract现在是工具集seq_crumbs的一部分(" https://sourceforge.net/projects/mira-assembler/) - 用于全基因组和/或转录组项目的多遍DNA序列数据汇编/映射器。 MIRA是一个多平台汇编器,可以单独从平台组合或从每个平台组合读取
  2. CAP3( http://seq.cs.iastate.edu/cap3。 html ) - CAP3用于小规模装配具有或不具有质量值的序列
  3. 修剪( http://www.usadellab.org/cms/? page = trimmomatic ) - Trimmomatic是一个快速的多线程命令行工具,可用于修剪和裁剪Illumina(FASTQ)数据以及删除适配器。
  4. FASTX( http://hannonlab.cshl.edu/fastx_toolkit/) - 预处理,例如。,适配器删除和基本调用限幅,短读(fastq文件)。
  5. Bowtie2( http://bowtie-bio.sourceforge.net/bowtie2/index.shtml ) - 短读取与参考基因组/转录组对齐。
  6. Samtools( http://www.htslib.org/) - SAM(序列比对/图谱)格式是用于存储大核苷酸序列比对的通用格式。 SAMTools提供了各种用于操作SAM格式对齐的实用程序,包括按位置格式排序,合并,索引和生成对齐。
  7. Getorf( http://emboss.sourceforge.net/download/) - 用于在mRNA序列中鉴定开放阅读框ORF的EMBOSS工具
  8. MIcroSAtellite(MISA)识别工具( http://pgrc.ipk-gatersleben.de/misa ) - 此工具允许识别和定位完美的微卫星以及被一定数量的基因中断的复合微卫星。
  9. SciRoKo( http://kofler.or.at/bioinformatics/SciRoKo/index.html ) - 一个快速整合的工具, 基因组微卫星搜索。 例如,可以在55秒内搜索整个水稻基因组。
  10. VarScan( http://varscan.sourceforge.net/) - VarScan是 平台无关的突变调用者在Illumina,SOLiD,Life/PGM,Roche/454和类似仪器上产生的靶基因组,外显子组和全基因组重测序数据。


  1. 数据
    1. 454-GS FLX Titanium mRNA重叠群保藏在NCBI的转录组鸟枪组装(TSA)仓库中:登录号为GBJO00000000(Sherman等人,2015)。
    2. Illumina短片段被保存在NCBI的短读档案(SRA)中:实验登录号SRX651793 GBJO00000000(Sherman等人。,2015)。

  2. 转录组件
    1. 454-GS FLX Titanium平台的原始序列读取通过"Sff_extract"( https://bioinf.comav.upv.es/sff_extract/)和用于删除适配器和剪切poly-A的参数。

    2. 使用MIRA 3.2(Chevreux ,2004)重新安装

    3. 通过运行Cap3和创建超重叠群减少重叠群可变性(合并转录本变体)


    4. 过滤长度小于200 bp的重叠群

  3. 功能注释
    1. 识别编码区域以注释标记位置,即在编码序列内部或外部,即。通过EMBOSS包的"getorf"程序(Rice等人,2000)找到开放阅读框(ORF)。对于每个重叠群(-find 1)选择具有起始和终止密码子的最长ORF,最小截止值为50个氨基酸(-minisize 150)。
      注意:参数(-minimize)以碱基对(50 bp x 3 = 150 bp)给出。

    2. 参考转录组contigs注释以使用Blast 2GO(Gotz等人,2008)连接变异性和功能。
      Blast2GO GUI选项:
      1. 开始→加载序列(例如。,fasta)
      2. Blast→运行Blast描述注释符
      3. 映射→运行映射
      4. 注释→运行注释
      5. InterPro→运行interproscan

  4. SNP和SSR发现
    适配器去除和低质量碱基对剪切由Trimmomatic(Bolger等人,2014)和FASTX( http://hannonlab.cshl.edu/fastx_toolkit/)。

    1. 使用修剪对文件的R1和R2,即。sample_name_pair_L001_R1.fastq,sample_name_pair_L001_R2.fastq用于下游分析。
    2. 使用bowtie2( http://bowtie-bio.sourceforge.net/bowtie2/index.shtml )。

    3. 运行samtools( http://www.htslib.org/)和VarScan( Koboldt等人,2009),用于SNP发现
      1. 在侧翼区中没有其它SNP(每侧100bp),以能够进行用于进一步分析的引物设计('Position'列中前一值和下一值之间的SNP值的绝对差应大于100) em>
      2. 每个参考 - 转录组重叠群("Chrom"列中的唯一值)只有一个SNP。
      3. 双等位基因置信度('SamplesHet'值> 0)。

    4. SSR发现内的contigs参考转录组。
      MIcroSAtellite(MISA)识别工具( http://pgrc.ipk-gatersleben.de/misa )和SciRoKo(Kofler等人,2007)使用默认参数运行。

    5. 通过将它们导入到MS-Access或SQLite中来查找两个表之间的交集,并在contig-name,motif和start-position上运行SQL内部join命令。

    6. 基因分型与Fluidigm
      1. 从年轻叶分离的样品基因分型的大规模基因组DNA提取
        1. 从果园收集年轻发育的芒果叶,在液氮中冷冻,并储存在-80℃直到使用。
        2. 将β-巯基乙醇加入在预热水浴中预热至65℃的提取缓冲液中。
        3. 使用IKA-A11分析磨碎机或用研钵和研杵将2g幼叶组织研磨成细粉末。
        4. 将地面组织转移至50ml Falcon管中,并用15ml预热的提取缓冲液提取。通过在65℃下孵育30分钟进行提取,偶尔混合管
        5. 将15ml氯仿:异戊醇混合物(24:1,v:v)加入试管中。将样品混合并在4℃下以17,000×g离心10分钟
        6. 将水相转移至新的50ml管中,并用15ml氯仿:异戊醇混合物(24:1,v:v)再次萃取。如上进行离心。
        7. 将水相转移到新管中。加入1体积的冰冷异丙醇,混合试管,通过在4℃以17,000×g离心20分钟来沉淀DNA。
        8. 小心处理上清液。用70%冰冷的乙醇洗涤沉淀,并在4℃下以17,000×g离心10分钟。
        9. 小心处理上清液。将颗粒在室温下干燥直至变为半透明),并悬浮于3ml TE缓冲液中。
        10. DNA溶液用3μlRNAse A处理,并在37℃下孵育30分钟
        11. 通过加入1/10体积的5M乙酸铵和2/5体积的冷的100%乙醇沉淀DNA。 将管混合并在4℃下以17,000×g离心10分钟。
        12. 小心处理上清液。 用70%冰冷的乙醇洗涤沉淀,并在4℃下以17,000×g离心10分钟。
        13. 将颗粒风干,最后将DNA悬浮于200μl双蒸水或TE缓冲液中
        14. 在0.7%TAE琼脂糖凝胶和Nanodrop分光光度计上分析DNA浓度和质量
      2. Fluidigm的基因分型 - EP1 Fluidigm FR96.96芯片的标准协议,具有四个无模板对照(NTC),而不是一个。
        简言之,将方案分为两个主要的子方案 - 预扩增和测定本身:
        1. 首先,通过执行以下步骤,执行特定靶扩增(STA)协议以具有来自每个靶的近似相等的比例:
          1)准备10x SNP类型STA。 用于96个测定的引物池 2)在PCR机器上执行STA。
        2. 其次,通过特异性靶引物进行基因分型的测定在EP1平台上的Fluidigm 96.96动态基因分型阵列中进行,如下:
          1) 启动96.96动态数组 TM IFC。
          3) 准备10x测定。
          4) 准备样品预混合和样品。
          6) 使用FC1 TM Cycler。
          7) 使用EP1 TM Reader数据收集软件。
          8) 提取下游生物信息学分析的数据。
        完整的协议描述可以在( http://www.mscience.com.au/upload/pages/fluidigmtech/fluidigm-snp-genotyping-user-guide-151112.pdf )。
    7. 过滤合格的SNP进行多样性分析
      1. 过滤类别≥2的SNPs(表1)。
      2. 过滤超过10%的SNP,无呼叫。
      3. 过滤掉含有超过33%无调用的样品
      4. 用PIC& 0.1。
      5. 用超过90%的样品过滤标记具有相同的调用,即,。,分隔完全相同。
      6. 在每个基因型中筛选出少于2个样本的标记。
      7. 从每对连锁标记中只留下一个标记(R ^ 2> 0.7)。
      8. 只留下来自具有相同基因型的样品组中的一个样品。 (身份≥0.95)
      1. 可以使用任何编程语言(例如R,python,perl,C等或SQL)执行这些步骤。
      2. PIC计算为PIC = 1-Σpi ^ 2; i = a,A
      3. R ^ 2被计算为r ^ 2 = D ^ 2 /(p1 * p2 * q1 * q2); D =(p11 * p22) - (p12p21 p11,p22,p12,p21是两个双等位基因座的所有可能组合的比例。



  1. 提取缓冲区
    100 Tris,pH 8
    20 M EDTA
    1.5 M NaCl
    1%β-巯基乙醇 除β-巯基乙醇外的所有溶液通过搅拌溶解几小时,并高压灭菌。 在组织提取之前加入β-巯基乙醇
  2. TE缓冲区
    10mM Tris,pH8.0 1mM EDTA




  1. Bolger,AM,Lohse,M。和Usadel,B.(2014)。  Trimomomatic:Illumina序列数据的灵活修剪程序。 生物信息学 30(15):2114-2120。
  2. Bowtie - 超快内存高效短读对齐器约翰·霍普金斯大学。
  3. Chevreux,B.,Pfisterer,T.,Drescher,B.,Driesel,AJ,Muller,WE,Wetter,T。和Suhai,S。(2004)。  使用miraEST装配商进行可靠和自动的mRNA转录组装和测序的ESTs中的SNP检测基因组 Res 14(6):1147-1159。
  4. Fluidigm SNP基因分型指南。 。
  5. FASTX工具包
  6. Gotz,S.,Garcia-Gomez,JM,Terol,J.,Williams,TD,Nagaraj,SH,Nueda,MJ,Robles,M.,Talon,M.,Dopazo,J.and Conesa, 。  使用Blast2GO的高吞吐量功能注释和数据挖掘 Nucleic Acids Res 36(10):3420-3435。
  7. Koboldt,DC,Chen,K.,Wylie,T.,Larson,DE,McLellan,MD,Mardis,ER,Weinstock,GM,Wilson,RKand Ding,L。(2009)。< a class ="ke -insertfile"href ="http://www.ncbi.nlm.nih.gov/pubmed/19542151"target ="_ blank"> VarScan:在个体和合并样品的大规模平行测序中的变体检测 Bioinformatics 25(17):2283-2285
  8. Kofler,R.,Schlotterer,C。和Lelley,T。(2007)。 SciRoKo:全基因组微卫星搜索和调查的新工具< a>。 Bioinformatics 23(13):1683-1685
  9. MIcro卫星识别工具
  10. 阅读/撰写/编辑/编入索引/查看SAM/BAM/CRAM格式。 Samtools 。
  11. Rice,P.,Longden,I.和Bleasby,A。(2000)。  EMBOSS:欧洲分子生物学开放软件套件 Trends Genet 16(6):276-277。
  12. Sff_extract 。 生物信息学。
  13. Sherman,A.,Rubinstein,M.,Eshed,R.,Benita,M.,Ish-Shalom,M.,Sharabi-Schwager,M.,Rozen,A.,Saada,D.,Cohen,Y.and Ophir ,R.(2015)。  Mango( Mangifera indica L.)基于源自转录组的单核苷酸多态性的种质多样性。 BMC Plant Biol 15:277。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2016 The Authors; exclusive licensee Bio-protocol LLC.
引用:Sharabi-Schwager, M., Rubinstein, M., Ish shalom, M., Eshed, R., Rozen, A., Sherman, A., Cohen, Y. and Ophir, R. (2016). Experimental Pipeline for SNP and SSR Discovery and Genotyping Analysis of Mango (Mangifera indica L.). Bio-protocol 6(16): e1910. DOI: 10.21769/BioProtoc.1910.