Quantification and statistical analysis

Samuel Morabito; Fairlie Reese; Negin Rahimzadeh; Emily Miyoshi; Vivek Swarup

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Quantification and statistical analysis

SM Samuel Morabito

FR Fairlie Reese

NR Negin Rahimzadeh

EM Emily Miyoshi

VS Vivek Swarup

This method is extracted from research article: Cell Rep Methods, Jun 2023

hdWGCNA identifies co-expression networks in high-dimensional transcriptomics data

DOI: 10.1016/j.crmeth.2023.100498

Ask a question

Favorite

The key resources table details the different datasets used throughout this manuscript. We used several published datasets generated by our own group,¹²^,²⁸^,⁵⁶ and sequencing data was not re-downloaded for these studies. For all human snRNA-seq datasets, we applied a uniform processing pipeline to process each dataset starting from the raw sequencing data and resulting in an anndata object⁷² containing UMI counts, normalized gene expression, cluster identities, and cell type annotations. Parameters used throughout this processing pipeline vary slightly between different datasets, and all parameters are noted in the data processing scripts in our github repository. For each biological replicate, we used the kb count function from kallisto $|$ bustools⁸¹ to psuedoalign raw sequencing reads to the reference transcriptome and quantify gene expression attributed to each cell barcode. The human reference transcriptome (GRch38) was obtained from the $10 \times$ Genomics website (version 2020-A, July 2020), and was re-formatted for use with kallisto $|$ bustools using the kb ref function. For each of the UMI counts matrices, we used the remove-background function from cellbender⁷⁴ to simultaneously identify which barcodes corresponded to cells and to remove counts attributed to ambient RNA. We then used scrublet⁷³ to compute ”doublet scores”, the likelihood of each barcode mapping to more than one cell. Counts matrices from each biological replicate in a given dataset are then merged into a single anndata object, and any relevant sample level meta-data (age, sex, disease status) was stored in the adata.obs table. We performed a percentile filtering of cells that were outliers from each dataset based on the number of UMI per cell the percentage of UMI attributed to mitochondrial genes per cell, and the doublet score. Filtering based on these criteria was performed in each sample, as well as dataset-wide. After filtering, downstream data processing steps were carried out with SCANPY.⁷² The UMI counts matrix was normalized with $\ln (CPM)$ using the functions sc.pp.normalize_total and sc.pp.log1p. Highly variable genes were identified using the function sc.pp.highly_variable_genes, and these genes are used as the features for downstream analysis steps such as principal component analysis (PCA). The normalized expression matrix was then scaled to unit variance and centered at zero using the function sc.pp.scale. PCA was performed on the scaled expression matrix using the function sc.tl.pca. Harmony²² was used to correct the PCA matrix for batch effects using the function sc.external.pp.harmony_integrate. The harmonized PCA matrix was then used to construct a cell neighborhood graph using the function sc.pp.neighbors. The cell neighborhood graph was then used to compute a two-dimensional representation of the data with uniform manifold approximation and projection¹⁹ using the function sc.tl.umap, and to group cells into clusters with Leiden clustering⁸² using the function sc.tl.leiden. We inspected the gene expression signatures in each Leiden cluster for a panel of canonical cell-type marker genes in order to assign a cell-type label to each cluster, and to identify additional doublet clusters that may have escaped the previous filtering steps. The distribution of quality control metrics was inspected in each cluster. We filtered out cells belonging to clusters that displayed conflicting expression of cell-type marker genes, or were outliers in their quality control metrics. After filtering these low-quality clusters, we ran UMAP and Leiden clustering again, resulting in the final processed dataset. We used a custom script to convert the datasets from anndata to SeuratObject by saving the individual components (counts matrix, cell meta-data, gene meta-data, dimensionality reductions, etc.) in Python and then loading them back into R to create a SeuratObject.

We performed an iterative co-expression network analysis of the major cell types (ASC, EX, INH, MG, ODC, OPC) in the human PFC snRNA-seq dataset from Zhou et al.,¹¹ only including samples from control brains (36,671 cells and 36,601 genes). We retained genes that were expressed in at least 5% of cells for downstream analysis. Metacells were computed separately for each major cell type and each sample using the hdWGCNA function MetacellsByGroups, aggregating 25 cells per metacell. Further, we ran MetacellsByGroups while varying the $K$ parameter in order to asses the resulting metacell expression matrix sparsity. For each cell type, we applied the following hdWGCNA commands with default arguments to perform network analysis: TestSoftPowers, ConstructNetwork, ModuleEigengenes, ModuleConnectivity, and RunModuleUMAP. We performed module preservation analysis²⁰ of the ODC co-expression modules in an external snRNA-seq dataset of the human PFC.¹² Modules were projected from the reference to query dataset using the hdWGCNA function ProjectModules, and the module preservation test was performed using ModulePreservation with 100 permutations.

For the purpose of co-expression network analysis, we compared our metacell aggregation approach (Algorithm 1) with two alternative approaches, namely Metacell2¹⁴ and SEACells.¹⁶ We ran the three metacell approaches using the recommended settings on the same dataset, and then ran hdWGCNA on each of the resulting metacell expression matrices. We used a scRNA-seq of 6,800 CD34⁺ hematopoietic stem and progenitor stem cells included with the SEACells package, and we used the cluster annotations from the original study. Notably, SEACells and Metacell2 do not account for cell labels in their aggregation procedures, which may result in a number of metacells containing transcriptomes from differently labeled cells. For the hdWGCNA metacell algorithm, we aggregated 50 cells per metacell. For the three metacell expression matrices derived from the different algorithms, we performed co-expression network analysis with the standard hdWGCNA pipeline by sequentially running the following functions with default parameters: TestSoftPowers, ConstructNetwork, ModuleEigengenes, ModuleConnectivity, and RunModuleUMAP. With the same cluster settings, Dynamic Tree Cut recovered a different number of co-expression modules for the three methods (hdWGCNA: 16 modules; MC2: 13 modules; SEACells: 20 modules). We performed pairwise comparisons between the gene modules detected with each metacell approach using Fisher’s exact test to test module overlaps. Additionally, we performed rank-rank hypergeometric overlap⁸³ (RRHO) tests using the RRHO function from the R package RRHO (version 1.13.0) to compare the kME ranking between modules across methods. To compare MEs and Seurat module scores, we ran the AddModuleScore function, and computed Pearson correlations between each ME and each module score.

We obtained a publicly available scRNA-seq dataset from Parse Biosciences of 1M peripheral blood mononuclear cells (PBMCs) from twelve healthy donors and twelve Type-1 diabetic donors generated using the Evercode Whole Transcriptome Mega protocol. This analysis was performed on a compute cluster with 200 GB of memory and eight CPU cores. The UMI counts matrix and sample meta data was downloaded from Parse Biosciences’ Website. We processed the counts matrix using SCANPY using a similar pipeline as described in the reprocessing published dataset section. For quality control, we excluded cells with greater than 25% mitochondrial reads, greater than 5,000 genes, and greater than 25,000 counts. After dimensionality reduction with PCA, Harmony²² batch correction, and Leiden clustering⁸² (resolution = 1), we annotated cell populations using PBMC marker genes obtained from Azimuth.⁷ We excluded clusters with conflicting cell-type markers as potential doublet populations, retaining a total of 965,363 cells and 26,862 genes for downstream analysis. The major cell compartments recovered in this analysis were similar to those reported by Parse Biosciences in their analysis, including as T-cells, B-cells, monocytes, dendritic cells, basophils, and plasmablasts. Following the SCANPY data processing, we wrote the individual components (counts matrix, cell meta-data, gene meta-data, dimensionality reductions, etc.) to disk so they could be loaded into R and assembled into a Seurat object.

We performed co-expression network analysis iteratively for the plasmablast, T-cell, B-cell, monocyte, and dendritic cell compartments using an hdWGCNA pipeline for each group (Figure S5). Metacells were constructed separately for each sample and each cell cluster with the hdWGCNA function MetacellsByGroups, aggregating 50 cells per metacell. The metacell aggregation step had a runtime of 85 min and 59 s. For each cell population, we first subset the Seurat object for the cell population of interest and then performed the standard hdWGCNA pipeline by sequentially running the following functions with default parameters: TestSoftPowers, ConstructNetwork, ModuleEigengenes, ModuleConnectivity, and RunModuleUMAP. We note that for the largest cell population (T-cells, 555,417 cells), the runtime for the network construction step was 186 s.

We tested the runtime and memory usage of the primary co-expression network analysis functions in hdWGCNA using the Velmeshev et al. 2019⁹ dataset. We selected the neuronal cell population from the dataset for network analysis, and downsampled the dataset at different sizes ranging from 1,000 to 50,000 cells to test the runtime and memory usage as a function of the number of cells in the input dataset. The following functions were tested: SetupForWGCNA, MetacellsByGroups, TestSoftPowers, ModuleEigengenes, and ModuleConnectivity. We tested ModuleEigengenes with and without Harmony correction. All of these tests were done using eight parallel threads, and hdWGCNA can be sped up further by increasing the number of parallel threads. Importantly, the number of input genes and other network analysis parameters also have an effect on runtime and memory usage.

We tested the functional coherence of hdWGCNA co-expression networks using the Extending ’Guilt-by-Association’ by Degree (EGAD)²³ algorithm. Connected genes in biological networks are potentially involved in the same processes, and EGAD evaluates this network property given a set of gene-process annotations. We performed functional coherence testing with the EGAD R package (version 1.18.0) using the six cell-type-specific co-expression networks from the Zhou et al. 2020¹¹ human PFC dataset. We downloaded a table of gene ontology associations for each gene from ensembl biomart, and formatted this table using the EGAD function make_annotations. We then ran the functional coherence test with EGAD using the function run_GBA, using the TOM as the input network, and we report the distributions of area under the receiver operating characteristic curve (AUC) values for each tested biological process in the six co-expression networks.

We used the xgboost R package²⁵ (version 1.7.3.1) to perform XGBoost regularized regression analysis to predict a given gene’s expression based on the expression of the top ten module hub genes for the module each gene was assigned to. This analysis was done using the six cell-type-specific co-expression networks described in the iterative network analysis of major cell types in the human cortex section. We performed 5-fold cross validation, and measured the performance of the model as a test set root-mean-square error (RMSE) averaged across the 5-folds. We ran XGBoost for 100 iterations for each individual test, with a maximum tree depth of 3 and regularization alpha of 0.5.

We collected the publicly available $10 \times$ Genomics Visium mouse brain dataset using the SeuratData R package. This dataset consists of an anterior and a posterior slice from a sagittal brain section, which we merged into a single Seurat object comprising 6,049 ST spots and 31,053 genes. We processed this dataset using the standard Seurat pipeline by sequentially running the following commands: NormalizeData, FindVariableFeatures, ScaleData, RunPCA, FindNeighbors, FindClusters, and RunUMAP. The top thirty PCs were used for Louvain clustering⁸⁴ and UMAP. While ST spots were clustered based on transcriptomic information alone, we were able to annotate them based on anatomical features.

Neighboring ST spots were aggregated into metaspots in the anterior and posterior slices using the hdWGCNA function MetaspotsByGroups. We retained genes expressed in 5% of spots for downstream analysis, totaling 12,355 genes. We tested for the optimal soft-power threshold $β$ based on the the fit to a scale-free topology using the hdWGCNA function TestSoftPowers. The co-expression network was constructed using all ST spots spanning both the anterior and posterior slices using the hdWGCNA function ConstructNetwork with the following parameters: networkType = “signed”, TOMType = “signed”, soft_power = 5, deepSplit = 4, detectCutHeight = 0.995, minModuleSize = 50, mergeCutHeight = 0.2. Module eigengenes and eigengene-based connectivities were computed using the ModuleEigengenes and ModuleConnectivity functions respectively. This approach identified 12 spatial co-expression modules, and we visualized the spatial distributions of these modules by plotting their MEs directly onto the biological coordinates for each spot. The co-expression network was projected into two dimensions using UMAP with the hdWGCNA function RunModuleUMAP, and we used the top five hub genes (ranked by kMEs) as the input features for UMAP. We used the R package enrichR⁸⁵ (version 3.0) to perform enrichment analysis on the top 100 genes in each module ranked by kME using the following databases: GO_Biological_Process_2021, GO_Cellular_Component_2021, GO_Molecular_Function_2021, WikiPathway_2021_Mouse, and KEGG_2021_Mouse. We assessed the overlap between genes from these spatial co-expression modules and differentially expressed genes in each cluster from a recent snRNA-seq study of the whole mouse brain using Fisher’s exact test implemented in the R package GeneOverlap (version 1.26.0). Finally, we performed a separate network analysis on a subset of the ST dataset only containing the cortical layers 2–6, and we followed an identical hdWGCNA analysis pipeline to the full ST dataset for the cortical analysis.

We performed isoform co-expression network analysis in radial glia lineage cells (radial glia, astrocytes, ependymal cells, and neural intermediate progenitor cells) from mouse hippocampus ScISOrSeq dataset from Joglekar et al.⁸ using the hdWGCNA R package. The gene-level counts matrix for this dataset was obtained from the Gene Expression Omnibus database (GEO: GSE15845), and the isoform-level counts matrix was obtained directly from the authors of the original study. We formatted this dataset as a Seurat object with an isoform-level expression assay and a gene-level expression assay. The standard Seurat processing pipeline was used on the gene-level expression assay, where we sequentially ran the functions NormalizeData, FindVariableFeatures, ScaleData, and RunPCA with default parameters. The dataset was projected into two dimensions by running UMAP on the PCA matrix with 30 components using the RunUMAP function. For all downstream purposes, the cell-type annotations from the original study were used.

Radial glia cells were selected for network analysis, and isoforms expressed in fewer than 1% of these cells were excluded, yielding a set of 2,190 cells and 10,375 isoforms from 4,770 genes. We constructed metacells separately for each cell type on the isoform-level expression assay using the hdWGCNA function MetacellsByGroups with $k = 30$ . We performed a parameter sweep for the soft-power threshold $β$ using the function TestSoftPowers. The isoform co-expression network was constructed using the ConstructNetwork function with the following parameters: networkType = “signed”, TOMType = “signed”, soft_power = 5, deepSplit = 4, detectCutHeight = 0.995, minModuleSize = 50, mergeCutHeight = 0.5. This approach identified 11 isoform co-expression modules. Isoform-level module eigenisoforms were computed using the ModuleEigengenes function, and eigenisoform-based connectivity was computed using the ModuleConnectivity function with default parameters. We computed a semi-supervised UMAP projection of the co-expression network using the hdWGCNA function RunModuleUMAP, with the module labels and the top six hub isoforms (by kMEiso) per module as the input features. We used the enrichR to identify enriched pathways in each module ranked by using the following databases: GO_Biological_Process_2021, GO_Cellular_Component_2021, GO_Molecular_Function_2021, WikiPathway_2021_Mouse, and KEGG_2021_Mouse.

To assess isoform co-expression network dynamics throughout the cellular trajectories within the radial glia lineage, we performed pseudotime analysis using Monocle 3³⁶ (version 1.0.0). We computed a UMAP of just radial glia lineage cells using the Monocle 3 function run_umap. A trajectory graph was built on this UMAP representation using the function learn_graph, and pseudotime was calculated with the function order_cells using the radial glia cells as the starting point. We split the pseudotime trajectory into three lineages based on the distinct cell fates (astrocyte, neuronal, and ependymal). We grouped cells into 50 evenly-sized bins throughout each trajectory, and we applied loess regression to the average module eigenisoform of each module in these bins to inspect the dynamics of each module throughout development. We wrote a custom script to generate a GTF of isoform models output from the ScISOrSeq pipeline. To visualize expressed isoforms, we plotted isoforms from this GTF on the UCSC genome browser as well as in Swan.³⁸

We selected inhibitory neurons from the Velmeshev et al.⁹ human autism spectrum disorder (ASD) snRNA-seq dataset for co-expression network analysis. Of the 121,451 cells in this dataset, 20,249 were labeled as inhibitory neurons based on marker gene expression profiles. We retained 11,194 genes which were expressed in at least 10% of cells from any cluster, and had non-zero variance in the inhibitory neuron population. Metacell transcriptomic profiles were constructed separately for each of the 54 samples and each cell type using the hdWGCNA function MetacellsByGroups, aggregating 50 cells into one metacell. We selected a soft-power threshold $β = 9$ based on the parameter sweep performed with the TestSoftPowers function. The co-expression network was computed with the ConstructNetwork function with the following parameters: networkType = “signed”, TOMType = “signed”, soft_power = 9, deepSplit = 4, detectCutHeight = 0.995, minModuleSize = 50, mergeCutHeight = 0.2. Module eigengenes were computed using the ModuleEigengenes function, and we applied Harmony²² to correct MEs based on sequencing batch. Eigengene-based connectivity for each gene was computed using ModuleConnectivity. The co-expression network was embedded in two dimensions using UMAP with the RunModuleUMAP function with the top five genes (ranked by kMEs) per module as the input features. Distributions of MEs were compared between ASD and control samples for each inhibitory neuron subpopulation using a two-sided Wilcoxon rank-sum test with the R function wilcox.test. We used the enrichR⁸⁵ to perform enrichment analysis on the top 100 genes in each module ranked by kME using the following databases: GO_Biological_Process_2021, GO_Cellular_Component_2021, GO_Molecular_Function_2021, WikiPathway_2021_Human, and KEGG_2021_Human. Furthermore, we computed the overlap between co-expression modules and ASD-associated genes from the SFARI Gene database using the R package GeneOverlap, which calculates the overlap between sets of genes using Fisher’s exact test.

We performed consensus co-expression network analysis of microglia in Alzheimer’s disease (AD) using three published snRNA-seq datasets.¹⁰^,¹¹^,¹² The individually processed datasets were merged into a single Seurat object comprising 189,127 nuclei, and the datasets were integrated into a common dimensionally-reduced space using PCA and Harmony.²² We retained all nuclei labeled microglia for network analysis based on expression of canonical marker genes such as CSF1R (9,904 nuclei), and genes expressed in at least 5% of microglia from any of the three studies were retained (7,900 genes). Metacells were constructed in groups of cells based on AD diagnosis status and study of origin, aggregating 25 cells per metacell. Within hdWGCNA, we used the SetMultiExpr function to create a list of expression matrices containing the selected genes and metacells for the three studies. We performed a separate parameter sweep for the three expression matrices using the hdWGCNA function TestSoftPowerConsensus, ensuring that we used an appropriate $β$ value for each dataset (Mathys et al.: $β = 6$ , Zhou et al.: $β = 8$ , Morabito & Miyoshi et al.: $β = 6$ ). The consensus co-expression network was contructed using the hdWGCNA function ConstructNetwork using the consensus = TRUE option. Individual TOMs were computed for each dataset, and they were scaled based on the 80th percentile in order to alleviate different statistical properties specific to each dataset rather than the underlying biology. A consensus TOM was computed by taking the element-wise minimum of the individual TOMs from each dataset. Therefore, large topological overlap values between two genes, which indicate a strong co-expression relationship, are supported across all three datasets in the consensus TOM. We performed hierarchical clustering on the consensus TOM, and we used the Dynamic Tree Cut algorithm³ was used to identify consensus co-expression modules based on the hierarchy. Module eigengenes were computed using the ModuleEigengenes function, and we applied Harmony²² to correct MEs based on the dataset of origin. Eigengene-based connectivity for each gene was computed using ModuleConnectivity. We visualized the network using UMAP with the top ten hub genes (ranked by kMEs) per module as the input features, annotating the hub genes and known disease-associated microglia genes.⁵⁰ We used the enrichR⁸⁵ to perform enrichment analysis on the top 100 genes in each module ranked by kME using the following databases: GO_Biological_Process_2021, GO_Cellular_Component_2021, GO_Molecular_Function_2021, WikiPathway_2021_Human, and KEGG_2021_Human.

We sought to model the transcriptional dynamics governing the shift between homeostatic and activated microglia in AD, therefore we performed pseudotime analysis using Monocle 3³⁶ to build a continuous trajectory of microglia cell states. A trajectory graph was built on the microglia UMAP using the function learn_graph, and pseudotime was calculated with the function order_cells. We oriented the start of pseudotime based on the expression of homeostatic microglia marker genes, such as P2RY12, CX3CR1, and CSF1R. We grouped cells into 50 evenly-sized bins throughout each trajectory, and we applied loess regression to the average module eigengene of each module in these bins to inspect the dynamics of each module throughout the microglia trajectory.

To link the integrated microglia snRNA-seq dataset with polygenic risk of disease for individual cells, we used the scDRS python package (version 1.0.0).⁵² This pipeline takes 1) a set of putative disease genes derived from GWAS summary statistics and 2) a scRNA-seq dataset as inputs, and outputs disease enrichment statistics for a given disease (raw and normalized disease scores, cell-level scDRS p value, and $Z$ -scores converted from the p values). GWAS summary statistics of 74 diseases and complex traits supplied by scDRS were utilized as gene sets, among which a gene set by Jansen et al.⁴⁶ provided the set of genes associated with AD. We then visualized the AD scDRS $Z$ -scores in the integrated AD microglia trajectory, and we correlated the scDRS score with the trajectory using a Pearson correlation.

We performed module preservation²⁰ analysis in a variety of external datasets from human and mouse¹¹^,¹²^,²⁸^,⁵⁶^,⁵⁷^,⁵⁸ to test for the reproducibility of the consensus AD microglia modules in the microglia population from each dataset. We used the hdWGCNA function ProjectModules to compute module eigengenes for the consensus AD microglia modules for each query dataset. The module preservation test was performed using the hdWGCNA function ModulePreservation with 100 permutations, and we reported the preservation $Z$ -summary statistics in a heatmap. For the Morabito & Miyoshi et al. snATAC-seq dataset, we used the gene activity⁵⁹ representation as a gene-level summary of chromatin accessibility in order to assess the module preservation at the epigenomic level.

We projected gene co-expression modules from two bulk RNA-seq studies of AD⁵⁶^,⁶³ into a published snRNA-seq study of AD to assess their expression patterns within various cell populations. While both of these studies used the samples from the same bulk RNA-seq cohort, the set of modules from Morabito et al. 2020⁵⁶ was based on a consensus network analysis across six brain regions while the other set of modules from the AMP-AD study⁶³ were constructed separately for seven different brain regions. Module eigengenes were computed for each of these bulk RNA-seq modules in the snRNA-seq dataset using the hdWGCNA function ProjectModules, using Harmony to correct MEs based on sequencing batch. We visualized the MEs of the projected modules in the snRNA-seq dataset using the Seurat function DotPlot.

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol