Multiomics data integration: MOFA

Kami Pekayvaz; Corinna Losert; Viktoria Knottenberg; Christoph Gold; Irene V van Blokland; Roy Oelen; Hilde E Groot; Jan Walter Benjamins; Sophia Brambs; Rainer Kaiser; Adrian Gottschlich; Gordon Victor Hoffmann; Luke Eivers; Alejandro Martinez-Navarro; Nils Bruns; Susanne Stiller; Sezer Akgöl; Keyang Yue; Vivien Polewka; Raphael Escaig; Markus Joppich; Aleksandar Janjic; Oliver Popp; Sebastian Kobold; Tobias Petzold

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Multiomics data integration: MOFA

KP Kami Pekayvaz

CL Corinna Losert

VK Viktoria Knottenberg

CG Christoph Gold

IB Irene V van Blokland

RO Roy Oelen

HG Hilde E Groot

JB Jan Walter Benjamins

SB Sophia Brambs

RK Rainer Kaiser

AG Adrian Gottschlich

GH Gordon Victor Hoffmann

LE Luke Eivers

AM Alejandro Martinez-Navarro

NB Nils Bruns

SS Susanne Stiller

SA Sezer Akgöl

KY Keyang Yue

VP Vivien Polewka

RE Raphael Escaig

MJ Markus Joppich

AJ Aleksandar Janjic

OP Oliver Popp

SK Sebastian Kobold

TP Tobias Petzold

This method is extracted from research article: Nat Med, May 2024

Multiomic analyses uncover immunological signatures in acute and chronic coronary syndromes

DOI: 10.1038/s41591-024-02953-4

Request a Protocol

Ask a question

Favorite

For the integrated and combined analysis of all the different data sources (single-cell data, cytokines, neutrophils, plasma proteomics and clinical values), we applied several preprocessing and normalization steps separately on the features of the different types of data to make them comparable and adjust the distributions.

We applied the pseudo-bulk approach to summarize single-cell data on the level of cell-type (cluster) specific gene expression per sample because all other omics data were measured on the bulk level. To this end, we calculated for each of the identified 18 cell-type clusters for each sample (=patient and timepoint) the mean counts across all cells. Afterward, we adjusted the gene counts of each sample in each cluster with a scaling factor so that each sample has the same amount of counts across all genes to account for technical differences in sequencing depth between the samples.

To ensure that we only consider reliably expressed genes, we applied additional filtering steps on the genes and clusters:

We excluded clusters 14–18, which had only very low numbers of cells per patient and timepoint (mostly less than 10 cells).

We filtered out genes based on the total number and percentage of cells that expressed those genes in the corresponding cluster, keeping only genes that fulfill one of the criteria below:

Percentage of cells expressing gene > 50 ∩ total number of cells expressing gene > 1,200

Percentage of cells expressing gene > 40 ∩ total number of cells expressing gene > 3,000

The thresholds were chosen considering that genes should be detectable in a high number of cells and in several samples, but at the same time, a considerable number of genes for each cluster should be kept.

After filtering, we log transformed the count values and applied quantile normalization as further normalization steps to align the distributions of gene expression levels between the samples.

This resulted in 11,831 features (which correspond to genes) across all the different clusters (ranging from 315 for cluster 13 and 2,159 for cluster 4), which we used as input for the different cell-type cluster dimensions from the single-cell data for the MOFA (Supplementary Fig. 3a).

To prepare the cytokine data for integration with the other datasets we set ‘OOR’ values to 0 and log transformed the values to adjust the distributions after adding a pseudo-count of 1 to all values. Furthermore, we excluded cytokines which have valid measured values in less than 20% of the samples. In total this resulted in 65 different cytokines which have been used as input features for the integrated analysis (Supplementary Fig. 3a).

As input features from the neutrophil dimension, we took the umi exon counts from the prime-seq and applied the processing steps below to align the reads with the scRNA-seq data, adjust for potential technical effects and strictly remove samples and genes with low-quality reads.

In a first step, we adjusted the gene names and mapped them from ‘ENSEMBL’ gene ids to ‘SYMBOL’ gene ids. Then, we filtered out ribosomal and mitochondrial genes as we also excluded them in the scRNA-seq data preprocessing, and they are not relevant for the analysis. Furthermore, we excluded genes that are not expressed in at least 80% of the samples and removed samples that do not have reads in at least 90% of the remaining genes. In the next step, we adjusted for differences in sequencing depth between the samples and normalized the counts with a scaling factor so that the sums of reads for each sample across all genes are the same. Then, we logarithmized the resulting counts. Finally, we decided to keep only highly variable genes, so we removed all genes whose variance lies below the 25% quantile of the variance distribution. This resulted in a total of 892 genes measured on 92 samples that are considered as input features for the neutrophil dimension. As for the scRNA-seq data, we applied quantile normalization to the counts in a final normalization step (Supplementary Fig. 3a).

For proteomics, we used the same preprocessing and normalization steps as described in the previous ‘Plasma proteome analysis’ section and took the resulting normalized and median-centered intensities measured for 490 different proteins as input features of this dimension (Supplementary Fig. 3a).

As input features of the clinical data dimension, we used the measured CK, CRP, CK-MB and troponin values and log transformed them (Supplementary Fig. 3a).

After these individual preprocessing steps, we had in total 13,382 features across 18 different dimensions (referred to as views throughout the paper) resulting from (1) clinical values, (2) cell-type cluster 1–14 of the scRNA-seq data, (3) cytokines, (4) plasma proteomics and (5) neutrophils (Supplementary Fig. 3a). We applied feature-wise quantile normalization onto the quantiles of the standard normal distribution for all data types.

Then we trained the MOFA model using the R/Bioconductor package MOFA2 (version 1.2.2) with maxiter parameter 50,000 to ensure convergence and 20 factors and the remaining default parameters. The number of estimated factors was chosen to balance the trade-off between explained variance and low number of factors. Also, we tested the influence of the specified number of factors on the model results by running alternative MOFA models with 5, 10, 15 and 25 factors and comparing them with the 20-factor model. We found that especially the first five inferred factors are not substantially affected by the choice of the number of factors (Supplementary Figs. 3a and 14d and Supplementary Table 16).

To evaluate the effect of the clinical features, we trained a second MOFA model excluding the 4 clinical features with in total 13,378 features and compared the resulting factor values and feature weights to the original model (Supplementary Figs. 7 and 8a,b).

On the feature weight matrix resulting from our trained MOFA model, we conducted pathway enrichment analysis for the first five inferred factors of the MOFA model using the gene set annotations from the REACTOME¹⁹ and KEGG⁶⁵ databases. We tested all pathways belonging to the ‘Immune System’ category in REACTOME (n = 191) and pathways that are classified as ‘Immune system’ or ‘Signal transduction’ pathways in KEGG (n = 52).

To test the enrichment of the pathways across all our data input views, we generated an extended pathway gene annotation set for those pathways in which a feature (consisting of data dimension, and gene and protein code) was considered to belong to the pathway if the gene and protein code maps to the genes annotated to the pathway. Features included in the MOFA but not within the pathway were taken as background set. To map the gene and protein codes to the gene-set annotations in KEGG and REACTOME (reacome.db, version 1.76.0; ReactomePA version 1.36.0), we used the bitr function from the clusterProfiler package (version 4.0.5) to convert them to ENTREZID.

We removed all the pathways for which we had included less than 20% of the total amount of genes annotated to the pathway in our feature set and ran the enrichment analysis using the ‘run_enrichment’ method implemented in the R/Bioconductor package MOFA2 (version 1.2.2) with set.statistic parameter ‘rank.sum’ and default parameters otherwise. We ran the enrichment separately for features with only positive or negative weights and jointly across all features.

Pathways with an adjusted P < 0.05 (Benjamini–Hochberg adjustment) have been considered to be significantly enriched.

To analyze the potential axes of cell–cell communication between different cell types, we used the previous knowledge about potential ligand–receptor–target interactions of the NicheNet²¹ resource collected in the nichenetr package (version 1.1.0) and loaded the provided ligand–receptor network and ligand–target matrix⁶⁶. Based on the classifications in those networks, we identified ligands, receptors and potential targets among the 13,382 features included in our integrated dataset resulting from the ‘data harmonization and integration step’. We calculated Spearman correlation between all identified ligand–target pairs within this dataset.

For the further analysis of the calculated ligand–target correlations in combination with the corresponding regulatory potential score, we only considered ligand–target pairs:

Between the ligands and targets of different cell types (for example, between monocytes and T cells) and different views (for example, between cytokines and the different cell-type clusters)

Subsequently, we focused on pairs with high correlation and regulatory potential scores where the target gene has a high feature weight on the analyzed MOFA factor.

For the lagged analysis of ligand–receptor target gene interactions, we applied the same analysis with the modification that we calculated Spearman correlation between the identified ligand–target pairs based on mapping lagged ligand expression to the corresponding target gene expression for ACS samples. This resulted in the following mapping across timepoints for each patient that was then used for the calculation of Spearman correlation and further processing as explained above:

To evaluate the prediction potential of our MOFA factors to distinguish our different patient groups, we calculated ROC curves contrasting the prediction power of the inferred factors to established clinical markers.

For Factor 4 predictions, we only considered factor values from samples measured at TP1M that could be classified to have a ‘good’ or ‘poor’ outcome. We compared the prediction potential of the factor values to the value of the clinical markers (CK, troponin, CRP) for those samples at TP1M. For the benchmarking against the prediction power across the complete time course of the clinical values, we took the maximum and mean values of those markers across all measured timepoints. In both cases, we scaled the clinical values as well as the factor values to be in a range between 0 and 1 and used them as input for the ROC curve calculation giving the probability of a sample being classified as ‘good’ versus ‘poor’ outcome.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol