C. Network identification with weighted gene correlation network analysis

JB James W. Bogenpohl
KM Kristin M. Mignogna
MS Maren L. Smith
MM Michael F. Miles
request Request a Protocol
ask Ask a question
Favorite

While multiple methods exist for clustering or network analysis of genomic data, the Weighted Gene Correlation Network Analysis algorithm (WGCNA) is a very widely used R software package to identify groups, known as modules, of correlated genes within microarray or other suitable data (Zhang and Horvath 2005). WGCNA is based on scale-free network topology, a model system that assumes a small number of highly connected nodes within a network. For transcriptomic data these nodes are referred to as ‘hub genes’. Due to their high connectivity, hub genes represent potential therapeutic targets to affect ethanol responsive gene expression in the brain and, potentially, ethanol behaviors. WGCNA involves multiple analysis steps that are outlined in detail in a series of R tutorials produced by the Horvath laboratory. These extensive tutorials and associated papers applying WGCNA are a major reason for the popularity of this approach. Instructions for obtaining the WGCNA R package and dependencies can be found at: https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/. A brief overview of a sample WGCNA analysis of microarray data from ethanol-treated mouse brain follows but the actual commands and options for the analysis can be found within the tutorials from the Horvath web site. Quite often, simple copy/pasting of those commands with changing file names or parameters is sufficient for carrying out an initial WGCNA analysis.

An appropriate dataset must be chosen for WGCNA analysis. As with any clustering technique, it is essential to have substantial biological variation across samples and to have enough samples such that correlation networks have sufficient statistical power. Although firm thresholds are difficult to define, Iancu and colleagues, using a large microarray dataset from mouse striatum, determined that a n ≥ 35 appeared optimal for defining network topology [4].

In order to identify networks with the most meaningful correlations, a variance filter is often first applied to e.g. a ComBat-corrected microarray dataset. This variance filter eliminates genes showing minimal variation in expression across samples. The median absolute deviation (MAD) is our preferred method for variance quantification:

A histogram of MAD values is plotted to identify the lower tail of variance. The number of genes included in network analysis may be limited by computer power. However, ideally, the proportion of variance should be calculated at regular intervals to determine a MAD threshold at which the majority of variation within the total data is included. All data below this threshold may be assumed to be noise, and excluded from further analysis.

The next step in WGCNA involves uploading gene expression data, and modeling to determine a soft-thresholding power at which the data structure best fits scale-free topology. This is done using the pickSoftThreshold() function as part of the WGCNA package. This function will output a table of scale-free fit metrics at various soft-thresholding powers (Table 1). We use the scale-free fit index (SFT R^2) as our primary measure of scale-free fit. A scale-free of 0.9 or greater is ideal, however, a scale-free fit of 0.75 or greater can be acceptable.

Scale free metrics resulting from the function pickSoftThreshold() within WGCNA. Results show that a power of ≥ 5 results in acceptable scale-free fit index values (SFT.R.sq).

Network construction is the next step in WGCNA. Both manual and automatic network construction and module identification are outlined, in detail, in the WGCNA tutorials. During module identification, gene expression data is organized by topological overlap distance. This data can be visualized using a cluster dendrogram. One particular variable to pay attention to during module construction is the “deep-split”. Deep-split is used to fine-tune the sensitivity of module detection by adjusting the branch cutting threshold within the dendrogram. Multi-dimensional scale plots using first and second principal components as the x and y-axes are another useful way to visualize modules in order to identify optimal deep-split value. We consider an optimal deep-split value to be one where there is minimal spatial overlap between modules.

Both modules and individual genes can then be correlated to phenotype data. Individual gene correlations are performed using gene expression measures such as RMA values. Modules are correlated to phenotype based on their first principal component, known in WGCNA as the module eigengene. The module eigengene is a value that explains the majority of gene expression variance within each identified module. Phenotype data can include many variables from behavioral data to technical variables such as RNA quality index of each microarray sample. This network correlation analysis is one of the most powerful features of WGCNA. Ideally, phenotypic and genomic data are derived from the same individual animals. Due to the limited sample size often seen in microarray studies, we recommend using Spearman Rank rather than Pearson correlation in order to minimize the influence of outliers. An example of network correlations to phenotypic data is shown in Figure 2.

Module membership correlation with gene significance. WGCNA analysis plot of member genes from one module. X-axis is module membership scoring where higher values represent genes with greater connectivity. Y-axis shows gene significance in terms of correlation of expression values versus a trait of interest. Genes with expression more highly correlated with trait of interest and showing higher connectivity (top right corner) are high value “hub genes”. High correlation of the module membership with gene significance strongly suggests this module is involved in biological mechanisms of the trait.

The identification of hub genes within modules is one of the final steps in WGCNA. Connectivity is usually the primary variable by which we identify hub genes. WGCNA produces several connectivity metrics with the command function: intramodularConnectivity(). For each gene, this command outputs its total connectivity within the network (kTotal), its connectivity within its assigned module (kWithin), the difference between kTotal and kWithin (kOut), and the difference between kWithin and kOut (kDiff). The intramodularConnectivity() function features an option to scale within module connectivity (kWithin) based on module size. We recommend scaling, as this eases identification of the most highly connected genes within their respective modules independent of module size.

Modules identified by WGCNA can be further interrogated for biological function enrichment using a wide variety of standard bioinformatics tools such as those for Gene Ontology or pathway analysis. Our laboratory frequently uses free web-based resources such as DAVID (http://david.abcc.ncifcrf.gov), ToppGene (https://toppgene.cchmc.org), WebGestalt (http://bioinfo.vanderbilt.edu/webgestalt/) or REVIGO (http://revigo.irb.hr) to identify and display over-represented functional categories. Additionally, module gene lists can be submitted to resources that construct networks based upon other biological information such as protein-protein interactions, published biological interaction, or transcription factor binding site analyses. Such tools include GeneMANIA (http://genemania.org) and the subscription-based Ingenuity Pathway Analysis (http://www.ingenuity.com). These tools can, in effect, validate the network structure of WGCNA-derived expression correlation networks.

WGCNA networks can be further validated and ranked based upon quantifying their overlap with other user defined gene lists obtained from differing biological contexts or from expression genetics datasets. For example, we frequently interrogate WGCNA modules for overlap with public or our own gene sets within the GeneWeaver web-based application (www.geneweaver.org). Finally, WGCNA module genes can be interrogated for correlations in other expression datasets, phenotypic correlations, and conserved genetic regulators by the rich resources available within GeneNetwork (www.genenetwork.org).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A