Gene Networks Based on the Graphical Gaussian Model

引用 收藏 1 提问与回复 分享您的反馈 Cited by



Genome Research
Nov 2007


This protocol describes how to build a gene network based on the graphical Gaussian model (GGM) from large scale microarray data. GGM uses partial correlation coefficient (pcor) to infer co-expression relationship between genes. Compared to the traditional Pearson’ correlation coefficient, partial correlation is a better measurement of direct dependency between genes. However, to calculate pcor requires a large number of observations (microarray slides) greatly exceeding the number of variables (genes). This protocol uses a regularized method to circumvent this obstacle, and is capable of building a network for ~20,000 genes from ~2,000 microarray slides. For more details, see Ma et al. (2007). For help regarding the script, please contact the author.

Data and Software

  1. Data
    Large-scale microarray data:
    The microarray data should be derived from the same platform, preferably from Affymetrix slides. Some good examples are: Affymetrix Arabidopsis ATH1 Genome Array, Affymetrix Human Genome U133 Plus 2.0 Array, and Affymetrix Mouse Genome 430 2.0 Array. A recommended place to search for this type of data is at the gene expression omnibus from NCBI (http://www.ncbi.nlm.nih.gov/geo/). The number of slides should be larger than 1,000.
  2. Software
    1. R (http://www.r-project.org/)
    2. The GeneNet package for R:
    3. Cytoscape (http://www.cytoscape.org/)
    4. Perl and C++ software environment


  1. Personal computer: Intel Core2 E6420 processor (or similar processing capability)


  1. Preparation of the microarray data
    1. Download the microarray data from your favorite database, and format it into a single table of expression intensities, with every row representing a gene and every column representing a microarray experiment. A good example can be found here for Arabidopsis transcriptomes: http://affy.arabidopsis.info/narrays/help/usefulfiles.html. You can use the file titled super bulk gene download.
    2. Remove any columns (experiments) containing large number of ‘null’ measurements, and then do the same for any genes containing ‘null’ measurements.
    3. Normalize the expression intensities between experiments using the quantile normalization method.

  2. Random sampling and partial correlation calculation
    1. Randomly pick 2,000 genes from the large expression table and make a small expression table for these 2,000 genes. A Perl script can be written to do this step.
    2. Using the GeneNet package to calculate partial correlation between these 2,000 randomly selected genes. The GeneNet package should be lauched within the R environment, and the specific function to be used is ‘ggm.estimate.pcor’ with the default settings.
    3. Save the resulting partial correlation matrix, together with the gene ids for the 2,000 genes.
    4. Repeat the step from 1 to 3 at least 1,999 times. The more the better. After these calculations, most of the gene pairs should be sampled >10 times, each time with a calculated pcor.
    5. Determine the final pcor values for every gene pair, so that pcor value with the smallest absolute values will be kept. This should be done via consolidating the resulted pcor matrix. This should be done with a C++ script.

  3. Network building and analysis
    1. To test the significance of the resulted pcors, the function ‘ggm.test.edges’ in GeneNet can be used. From all the pcors, ~2,000,000 can be randomly selected and fed into the function, so that a pValue for significance can be calculated.
    2. Depending on the pValue, a cutoff for the pcors can be set. A good estimation would be 0.1, 0.08, and 0.05. Any pcor with absolute value larger than the cutoffs can be retained.
    3. A Pearson’ correlation coefficient filter should be applied. Gene pairs with Pearson’ correlation coefficient value between -0.3 and 0.3 should be removed.
    4. After the pcor selection and Pearson correlation coefficient filters, the remaining gene pairs are said to have interaction between each other, and can be used to build a gene network using Cytoscape software. The network analysis can be done with the Cytoscape software itself.


This protocol was developed by the author in Hans Bohnert’s lab, Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. The work was supported by grants from the National Science Foundation Plant Genome Project (DBI-0223905) and University of Illinois at Urbana-Champaign institutional grants.


  1. Ma, S., Gong, Q. and Bohnert, H. J. (2007). An Arabidopsis gene network based on the graphical Gaussian model. Genome Res 17(11): 1614-1625.
  2. Schafer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4: Article32.


该协议描述了如何从大规模微阵列数据基于图形高斯模型(GGM)构建基因网络。 GGM使用部分相关系数(pcor)来推断基因之间的共表达关系。 与传统的Pearson相关系数相比,部分相关性是基因之间直接依赖性的较好测量。 然而,为了计算pcor,需要大量超过变量(基因)数量的大量观察(微阵列载玻片)。 这个协议使用规则的方法来规避这个障碍,并能够为〜2,000个微阵列载玻片构建约20,000个基因的网络。 更多细节,参见Ma et al。 (2007)。 有关脚本的帮助,请联系作者。


  1. 数据
    微阵列数据应当来自相同的平台,优选来自Affymetrix载玻片。 一些好的实例是:Affymetrix拟南芥ATH1基因组阵列,Affymetrix人类基因组U133 Plus 2.0阵列和Affymetrix小鼠基因组430 2.0阵列。 搜索此类型数据的推荐位置是来自NCBI的基因表达omnibus( http://www.ncbi.nlm.nih.gov/geo/)。 幻灯片数量应大于1,000。
  2. 软件
    1. R( http://www.r-project.org/
    2. R GeneNet软件包:
      http://www.uni-leipzig.de/~ strimmer/lab/software/genenet/index.html
    3. Cytoscape( http://www.cytoscape.org/
    4. Perl和C ++软件环境


  1. 个人计算机:Intel Core2 E6420处理器(或类似处理能力)


  1. 微阵列数据的制备
    1. 从您最喜欢的数据库中下载微阵列数据,并将其格式化为单个表达式强度表,每行代表一个基因,每列代表一个微阵列实验。 这里可以找到一个很好的例子: Arabidopsis transcriptomes: http://affy.arabidopsis.info/narrays/help/usefulfiles.html 。您可以使用名为super bulk gene download的文件。
    2. 删除包含大量"零"测量的任何列(实验),然后对包含"null"测量的任何基因执行相同操作。
    3. 使用分位数归一化法对实验之间的表达强度进行归一化

  2. 随机抽样和部分相关计算
    1. 从大表达式表中随机挑选2,000个基因,并为这2,000个基因制作小表达表。 可以编写Perl脚本来执行此步骤。
    2. 使用GeneNet软件包计算这些2,000个随机选择的基因之间的部分相关性。 GeneNet软件包应该在R环境中被刷新,要使用的具体功能是使用默认设置的"ggm.estimate.pcor"。
    3. 保存所得的部分相关矩阵,以及2,000个基因的基因ID
    4. 重复步骤1至3至少1,999次。 越多越好。 在这些计算之后,大多数基因对应被取样≥10次,每次具有计算的pcor
    5. 确定每个基因对的最终pcor值,以便保持具有最小绝对值的pcor值。 这应该通过合并所得到的pcor矩阵来完成。 这应该使用C ++脚本。

  3. 网络建设和分析
    1. 为了测试所得pcors的显着性,可以使用GeneNet中的函数'ggm.test.edges'。 从所有的pcors,〜2,000,000可以随机选择并馈入函数,以便可以计算重要的pValue。
    2. 根据pValue,可以设置pcors的截止值。 好的估计是0.1,0.08和0.05。 可以保留绝对值大于截止值的任何pcor
    3. 应该应用Pearson相关系数滤波器。 应删除基因对与Pearson相关系数值在-0.3和0.3之间
    4. 在pcor选择和Pearson相关系数滤波器之后,剩余的基因对被认为彼此之间具有相互作用,并且可以用于使用Cytoscape软件构建基因网络。 网络分析可以用Cytoscape软件本身完成。


这个协议由作者在Hans Bohnert的实验室开发,伊利诺伊大学植物生物学系,Urbana-Champaign,Urbana,Illinois,USA。 这项工作得到国家科学基金会植物基因组计划(DBI-0223905)和伊利诺伊大学厄巴纳 - 香槟大学的赠款资助。


  1. Ma,S.,Gong,Q.and Bohnert,H.J。(2007)。 基于图形高斯模型的拟南芥基因网络。 Genome Res 17(11):1614-1625。
  2. Schafer,J。和Strimmer,K。(2005)。 大规模协方差矩阵估计的缩小方法和对功能基因组学的影响 em> Stat Appl Genet Mol Biol 4:Article32。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2012 The Authors; exclusive licensee Bio-protocol LLC.
引用:Ma, S. (2012). Gene Networks Based on the Graphical Gaussian Model. Bio-protocol 2(4): e119. DOI: 10.21769/BioProtoc.119.



Prashanth Suravajhala
Birla Institute of Scientific Research
This was a very useful protocol indeed. Yes, to a larger extent! Whence proposing a six point classification scoring schema for predicting the function of hypothetical proteins, we wondered if two interacting proteins shown in our proposed hypothome (interactOME of HYPOTHetical proteins) could coexpress. The transcriptomic profiles were checked albeit we used a GUI based web models to find inferences from this protocol.
5/12/2015 2:07:30 AM Reply