2.3. Weighted Gene Co-Expression Analysis (WGCNA) of Datasets

Jon Patrick T. Garcia; Lemmuel L. Tayo

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.3. Weighted Gene Co-Expression Analysis (WGCNA) of Datasets

JG Jon Patrick T. Garcia

LT Lemmuel L. Tayo

This method is extracted from research article: Genes (Basel), Mar 2024

Theoretical Studies of DNA Microarray Present Potential Molecular and Cellular Interconnectivity of Signaling Pathways in Immune System Dysregulation

DOI: 10.3390/genes15040393

Request a Protocol

Ask a question

Favorite

Using Bioconductor v3.18 [https://www.bioconductor.org/ (accessed on 10 January 2024)], which is an open software equipped with features for computational biology studies, each dataset was loaded into the R program v4.3.2 [https://cran.r-project.org/bin/windows/base/ (downloaded on 24 January 2024)] for pre-processing. Along with that, the “affy” and “biomaRt” packages were also installed, the former of which is intended for oligonucleotide array analysis and the latter of which is used for the easy and uniform retrieval of large data without the need for complex database schemas [22]. Each raw data point obtained from the datasets was normalized using the Robust Microarray Analysis (RMA) function in the program. Then, the expression data of the four datasets were extracted, unnecessary extensions found in the samples were eliminated, and control probes were removed. After which, the mean and variance of the expression data were calculated and the 20% percentile of each was determined. The expression data were filtered such that only those genes that obtained a mean and variance higher than the 20% cutoff were retained. IDs of common probes among the expression data were converted to gene symbols, and those entries without gene symbols were removed. Finally, the expression data were subjected to log-2 transformation using the “goodSamplesGenesMS” function to filter out genes that contained missing values. Initial weighted gene co-expression networks were also generated with the aid of scattered plot diagrams to check the comparability of the datasets before modularization. Each dataset was compared to the others, with plots showing the following dataset comparisons: T1D vs. PSR, T1D vs. SSc, T1D vs. SLE, PSR vs. SSc, PSR vs. SLE, and SSc vs. SLE.

The soft-thresholding power (β) was determined by plotting the network topology of each dataset, as well as the mean and median connectivity fits to confirm the result, using the “pickSoftThreshold” function from the WGCNA package. The value of β at which the plots begin to flatten out was chosen as the soft-thresholding power in constructing the respective scale-free topologies of the datasets. To further evaluate the β of choice, scatter plots were generated, and the correlation coefficient of each was calculated. The dataset with the highest fit at a particular value of β and the highest correlation coefficient was used as the reference dataset for succeeding analyses.

The chosen value of β was then utilized to calculate the adjacency matrices of the reference dataset using the “adjacency” function through a “signed” network type via Pearson’s correlation. After this, the results were subjected to topological overlap measure (TOM) dissimilarity using the “1-TOMsimilarity” function through a “signed” TOM type. This was to perform hierarchical clustering of the genes in the reference dataset to identify co-expression networks or functional modules within the clusters. A dendrogram was then plotted to show the tips of the branches, which are associated with those highly correlated genes found in the reference dataset. These tips denote the clusters from which modules were identified. Aside from that, a dynamic tree-cutting algorithm was also utilized using the hybrid tree cut. A deep-split parameter ranging from 0 to 3 was evaluated to control the sensitivity of the algorithm, as well as to better visualize the distribution of the modules per deep split.

Using the “modulepreservation” function from the WGCNA package through a “signed” network type, the weighted gene co-expression network preservation of the reference dataset was calculated relative to the other three datasets. The number of permutations was set to 100, and the maximum module size was set to 10,658, which is the total number of genes obtained from the grey module. Those modules that were found to be highly preserved across the datasets, that is, modules that obtained a z-score >10 in the other three datasets, were used for succeeding analyses. Afterward, the modules were further processed by calculating the eigengene-based connectivity (kME) of each gene using the “moduleEigengenes” function from the WGCNA package to quantify their respective connectivity among the other genes within the module. The module membership of each dataset was then programmed by correlating the eigengene and expression profile of each gene, and these were ranked from highest to lowest within the modules. Scatter plot diagrams under a set p-value < 0.05 were drawn to visualize these correlations.

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol