Downstream computational analysis

T. Lohoff; S. Ghazanfar; A. Missarova; N. Koulena; N. Pierson; J. A. Griffiths; E. S. Bardot; C.-H. L. Eng; R. C. V. Tyser; R. Argelaguet; C. Guibentif; S. Srinivas; J. Briscoe; B. D. Simons; A.-K. Hadjantonakis; B. Göttgens; W. Reik; J. Nichols; L. Cai; J. C. Marioni

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Downstream computational analysis

TL T. Lohoff

SG S. Ghazanfar

AM A. Missarova

NK N. Koulena

NP N. Pierson

JG J. A. Griffiths

EB E. S. Bardot

CE C.-H. L. Eng

RT R. C. V. Tyser

RA R. Argelaguet

CG C. Guibentif

SS S. Srinivas

JB J. Briscoe

BS B. D. Simons

AH A.-K. Hadjantonakis

BG B. Göttgens

WR W. Reik

JN J. Nichols

LC L. Cai

JM J. C. Marioni

This method is extracted from research article: Nat Biotechnol, Sep 2021

Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis

DOI: 10.1038/s41587-021-01006-2

Request a Protocol

Ask a question

Favorite

To lower the chance of counting cells multiple times in contiguous z slices, we selected two z slices (denoted 1 and 2 hereafter) for further analysis corresponding to two parallel tissue layers 12 µm apart. We then removed segmented regions most likely to correspond to empty space rather than cell-containing regions by testing whether a putative cell’s square root-transformed segmented area was larger than expected (Z test; FDR threshold of 0.01). Of the remaining segmented regions, we considered segments containing at least 10 detected mRNA molecules corresponding to at least five unique genes as true cells.

To construct a cell neighborhood network, for each cell within a given embryo and z slice, we extracted the polygon representation of the cell’s segmentation corresponding to a set of vertex coordinates. We then calculated an expanded segmentation by constructing a new polygon where each expanded vertex was lengthened along the line containing the original vertex and the center of the polygon. We performed a multiplicative expansion of 1.3 for each vertex. To construct the cell neighborhood network, we then identified the other cells in which segmentation vertices were found to be within the expanded polygon. Cell neighborhood networks were considered separately for each embryo and z slice combination.

We calculated normalized expression log counts for each cell using scran’s logNormCounts function¹⁰⁸, with size factors corresponding to the total number of mRNAs (excluding the sex-specific gene Xist) identified for each cell. Size factors were scaled to unity, and a pseudocount of 1 was added before the log counts were extracted. For the majority of downstream analyses, such as differential gene expression, we specifically included biological and technical variables (that is, z slice and FOV) as covariates. However, for the task of harmoniously visualizing gene expression in spatial coordinates, we extracted ‘batch-corrected expression’ values for each gene. This was done by first performing batch correction using the MNN method, implemented with fastMNN in the scran package¹⁰⁸, with batch variables corresponding to z slice and FOV. For interpretable visualization, for each gene, we extracted the reconstructed expression values following batch correction and rescaled these to correspond to the distribution of expression values before batch correction.

To identify unsupervised clusters, we first performed multibatch-aware principal component analysis (PCA) on the normalized log counts using the multiBatchPCA function in scran¹⁰⁸, with z slice and FOV as batch variables using all genes except Xist as input to extract 50 PCs. We then performed batch correction using the MNN approach, resulting in a corrected reduced dimension embedding of cells. To identify clusters, we estimated a shared nearest neighbor network, followed by Louvain network clustering. To further extract unsupervised subclusters, for each set of cells belonging to a given cluster, we performed highly variable gene selection to select genes with a non-zero estimated biological variance, excluding the sex-specific gene Xist. Using these selected genes, we performed batch-aware PCA to extract 50 PCs, followed by batch correction, shared nearest neighbor network construction and Louvain clustering similar to what was performed for all cells.

We downloaded the E8.5 Pijuan-Sala et al.⁶ 10x Genomics scRNA-seq dataset from the Bioconductor package MouseGastrulationData and performed batch-aware normalization using the multiBatchNorm function in the scran package¹⁰⁸ before extracting cells that correspond to a known cell type with at least 25 cells. Cell types associated with the somitic and paraxial mesoderm were further refined using labels assigned by Guibentif et al.¹¹⁸ (personal communication); blood subtypes (erythroid 1, erythroid 2 and erythroid 3 and blood progenitors 1 and 2) were collapsed to the two major groups, ExE mesoderm was renamed to lateral plate mesoderm and pharyngeal mesoderm was renamed to splanchnic mesoderm. Subsequently, only genes probed by both the scRNA-seq and seqFISH assays were kept for this analysis. We then jointly embedded the normalized log counts of each of the two datasets by performing batch-aware PCA with 50 components (excluding the sex-specific gene Xist) via the multiBatchPCA function in scran, with batch variables corresponding to sequencing runs in the Gastrulation atlas and FOV and z slice for the seqFISH data. We corrected for platform- and batch-specific effects using the MNN method (fastMNN⁵⁵), ensuring that merge ordering is such that Gastrulation atlas batches are merged first (ordered by decreasing number of cells). This joint embedding of the Gastrulation atlas and seqFISH dataset was further reduced in dimension using UMAP, implemented by calculate UMAP in scran¹⁰⁸, to allow the data to be visualized in two dimensions.

To assign a cell-type label to each seqFISH cell, we considered the Gastrulation atlas cells that it was closest to in the batch-corrected space. We considered the k-nearest cells, with the distance from the seqFISH cell to its Gastrulation atlas neighbors being computed as the Euclidean distance among all of the batch-corrected PC coordinates. We set the number of nearest neighbors, k, to 25. Ties were broken by favoring the cell type of those closest in distance to the query cell. We calculated a ‘mapping score’ for each query cell as the proportion of the majority cell type present among the 25 closest cells.

To further refine the predicted cell types, we performed joint clustering of the Gastrulation atlas and seqFISH cells by building a shared nearest neighbor network on the joint PCs followed by Louvain network clustering. Additionally, we further subclustered the output by building a shared nearest neighbor network on the cells corresponding to each cluster followed by Louvain network clustering. We then inspected the relative contribution of cells to each subcluster as well as the expression of marker genes to identify subclusters that potentially required manual reannotation, either due to small differences in composition in the reference atlas or in the gene expression profile (Extended Data Fig. Fig.3).3). We also identified a set of subclusters that were likely associated with low-quality cells, defined by lower total mRNA counts. Furthermore, we performed virtual dissection on regions corresponding anatomically to the developing gut tube and for these cells reclassified those that were ‘Surface ectoderm’ as ‘Gut tube’.

For the specific task of recovering cell-type identity, we investigated whether fewer genes would be sufficient. To do this, we randomly selected subsets of genes from the 351 gene set, corresponding to approximately 10, 20, 30, …, 90% of the genes, repeated five times for each subset. Because there is a lack of ground truth of the cell-type labels for the seqFISH data, we assessed the cell-type classification accuracy relative to the full probe set, that is, we made the assumption that the classified cell type using the 351 genes is the true label, thus measuring the degree of loss of accuracy from this labeling. While ground truth labels are available for the Gastrulation atlas dataset, for consistency we calculated the relative accuracy following resubstitution classification for these cells by also treating the classified cell type using the 351 genes as the true label.

Any difference in cell-type recovery accuracy between the seqFISH and Gastrulation atlas data could be attributed to the experimental strategy (scRNA-seq versus seqFISH) or to the fact that the Gastrulation atlas data was initially mined for these informative genes, and, as a result, the resubstitution classification accuracy may be inflated for these cells. Thus, we extracted the host WT cells of the E8.5 WT/WT control chimera from Pijuan-Sala et al.⁶, which served as an independent validation set, representing a dataset that was not mined for informative genes but also corresponds to the same experimental strategy as the Gastrulation atlas (scRNA-seq).

We performed joint integration of these three datasets using the randomly selected gene subsets and calculated the relative cell-type classification accuracy compared to the full gene set for each dataset.

To analyze the mixed mesenchymal mesoderm population, we performed highly variable gene selection for these cells only using the ‘modelGeneVar’ function in scran¹⁰⁸ and performed PCA (excluding the sex-specific gene Xist) on the normalized log counts followed by batch correction using MNN, with embryo and z slice as batch variables. We then further reduced these corrected PCs into two dimensions using UMAP for visualization purposes. To identify mixed mesenchymal mesoderm subclusters, we estimated a shared nearest neighbor network, followed by Louvain network clustering. We then performed differential expression analysis on the seqFISH genes and on the imputed gene expression values (described further below) using the ‘findMarkers’ function in scran¹⁰⁸ and Gene Ontology enrichment analysis as described below. To further identify the spatial context for the mixed mesenchymal mesoderm, for each cluster, we extracted the cells that appear as direct contact neighbors with any cell belonging to the cluster and recorded their corresponding cell type. To assess the relative association of each mixed mesenchymal mesoderm subcluster to the Gastrulation atlas⁶, we calculated a weighted score per Gastrulation atlas cell and mixed mesenchymal mesoderm subcluster, corresponding to the average ranking of the Gastrulation atlas cell among the top 25 nearest neighbors for each mixed mesenchymal mesoderm subcluster cell.

We identified genes with a spatially heterogeneous pattern of expression using a linear model with observations corresponding to each cell for a given cell type and with outcome corresponding to the gene of interest’s expression value. For each gene, we fit a linear model including the embryo and z slice information as covariates as well as an additional covariate corresponding to the weighted mean of all other cells’ gene expression values. The weight was computed as the inverse of the cell–cell distance in the cell–cell neighborhood network.

More formally, let x_ij be the expression of gene i for cell j. Calculate $x_{i j}^{*}$ as the weighted average of other K cells’ expression weighted by distance in the neighborhood network

where

is the path length in the neighborhood network between vertices corresponding to cells j and k. We then fit the linear model for each gene

Here, e and z correspond to the embryo and z slice identity of the cells, respectively, and ε represents random normally distributed noise. The t-statistic corresponding to β₁ is reported here as a measure of spatial heterogeneity for the given gene, a large value corresponding to a strong spatial expression pattern of the gene in space, especially among its neighbors.

To further subcluster the developing brain cells, we extracted the Gastrulation atlas cells corresponding to E8.5 that were classified as forebrain/midbrain/hindbrain. For these cells, we identified genes to further cluster by using the scran function modelGeneVar¹⁰⁸ to identify highly variable genes with non-zero biological variability, excluding the sex-specific gene Xist. For these genes, we extracted the cosine-standardized log counts, which were standardized against the entire transcriptome. We then performed batch correction using the MNN method on batch-aware PC coordinates, where batches corresponded to the sequencing samples. Using this batch-corrected embedding, we estimated a shared nearest neighborhood network and performed Louvain network clustering. To relate these brain subcluster labels to the seqFISH data, we extracted the nearest neighbor information (as described in Cell type identification) for seqFISH cells corresponding to forebrain/midbrain/hindbrain and classified their brain subcluster label using k-nearest neighbors with k = 25 and closest cells breaking ties. We then named these subclusters based on marker gene expression, including a class that may be technically driven (NA class).

We constructed cell–cell contact maps for multiple cell annotation labelings, including mapped cell types, subclusters within each cell type and mapped gut tube subtypes. To do this, for each embryo and z slice combination, we extracted the cell neighborhood network and cell-level annotation. We then generated cell–cell contact maps by first calculating the number of edges for which a particular pair of annotated groups was observed. We then randomly reassigned (500 times) the annotation by sampling without replacement and calculated the number of edges for all pairs of annotated groups. To construct the cell–cell contact map, we reported the proportion of times the randomly reassigned number of edges was larger than or equal to the observed number of edges. Small values correspond to the pair of annotation groups being more segregated, and large values correspond to them being more integrated in physical space than a random allocation. To combine these cell–cell contact maps for each embryo and z slice combination, we further calculated the element-wise mean for each pair of cell labels. We visualized this in a heat map, ordering the annotation groups using hierarchical clustering with Euclidean distance and complete linkage. In the case of the gut tube subtypes, we ordered these classes by the anterior–posterior ordering given by Nowotschin et al.². In the brain subtypes, we ordered these classes by their approximate anatomical location, from the forebrain to the hindbrain region.

To functionally annotate sets of gene clusters, we performed gene set enrichment analysis using mouse Gene Ontology terms with between 10 and 500 genes appearing in each dataset and at least 1 gene appearing from the testing scaffold¹¹⁹ using Fisher’s exact test to test for overrepresentation of genes and using all scHOT-tested genes as the gene universe. An FDR-adjusted P < 0.05 was considered to be statistically significant.

Below we outline the different elements of our strategy for imputing the spatially resolved expression of genes not profiled using seqFISH.

First, for each gene in the seqFISH library (excluding the sex-specific gene Xist), we performed an intermediate mapping to align each seqFISH cell with the most similar set of cells in the scRNA-seq dataset. To perform the mapping we excluded the gene of interest and used the remaining 349 genes (351 seqFISH genes – Xist – gene of interest), resulting in 350 intermediate mappings overall. The leave-one-gene-out mapping approach was used to assess whether the intermediate mapping strategy outlined below could be used to estimate the expression counts of the omitted gene.

Similar to the integration strategy described earlier for assigning cell-type labels, for each embryo and z slice, we concatenated the cosine-normalized seqFISH counts with the cosine-normalized expression values from the Gastrulation atlas scRNA-seq data⁶. We performed dimensionality reduction using ‘multibatchPCA’ (using 50 PCs) and performed batch correction using the ‘reducedMNN’ function implemented in scran¹⁰⁸. Next, for each cell in the seqFISH dataset that was assigned a cell-type identity in the earlier integration, we used the ‘queryKNN’ function in BiocNeighbors to identify its 25 nearest neighbors in the scRNA-seq data. Finally, for each seqFISH cell, the expression count of the gene of interest is estimated as the average expression of the corresponding gene across the cell’s 25 nearest neighbors.

For each mapped gene, its performance score is calculated as the Pearson correlation (across cells) between its imputed values and its measured seqFISH expression level. To estimate an upper bound on the performance score (that is, the maximum correlation we might expect to observe), we took advantage of the four independent batches of E8.5 cells that were processed in the scRNA-seq Gastrulation atlas. In particular, we treated one of the four batches as the query set and used the leave-one-out approach described above to impute the expression of genes of interest by mapping cells onto a reference composed of the remaining three batches. Additionally, to mimic the seqFISH imputation, we considered a subset of the Gastrulation atlas data consisting of only those genes that were probed in the seqFISH experiment. Moreover, due to the experimental procedure, some cell types present in the Gastrulation atlas (for example, extraembryonic cell types) were not probed in the seqFISH experiment. Accordingly, we considered only the subset of scRNA-seq profiled cells that were among the nearest neighbors of a seqFISH-mapped cells so that this subset most faithfully resembled the seqFISH data.

Subsequently, for each mapped gene, we computed its prediction score as the weighted Pearson correlation between its imputed expression level and its true expression level. The weights were proportional to the number of times each Gastrulation atlas cell was present among the neighbors of a seqFISH cell across all profiled genes.

Finally, for each gene probed in the seqFISH experiment, we define its normalized imputation performance score as the ratio of its performance score over its prediction score.

To perform imputation for all genes, we aggregated across the 350 intermediate mappings generated from each gene probed using seqFISH. Specifically, for each seqFISH cell, we considered the set of all Gastrulation atlas cells that were associated with it in any intermediate mapping. Subsequently, for every cell, we calculated each gene’s imputed expression level as the weighted average of the gene’s expression across the associated set of Gastrulation atlas cells, where weights were proportional to the number of times each Gastrulation atlas cell was present. Thus, the imputed expression profiles for all genes, including those in the overlapping gene set, are on the same scale as the scRNA-seq log count data.

To identify the MHB, we visually assessed the expression of the well-described mesencephalon and prosencephalon marker Otx2 and the rhombencephalon marker Gbx2 (Supplementary Fig. 13). We manually selected the physical region where both genes are expressed and defined this as the FOV (black rectangle, Supplementary Fig. 13). Subsequently, within the selected region, we performed a virtual dissection by manually choosing the boundary that best discriminates the expression of Otx2 and Gbx2 (Supplementary Fig. 13), and, based on the boundary, we assigned cells either a midbrain or hindbrain identity.

Differential expression analysis was performed between midbrain- and hindbrain-assigned cells using the scran function ‘findMarkers’ (with an LFC threshold of 0.2 and an FDR-adjusted P value threshold of 0.05; Supplementary Table 7).

To perform diffusion analysis of the MHB region, we performed batch correction of the FOVs and z slice using the MNN approach, with log counts of all genes excluding the sex-specific gene Xist as input. We then used the diffusion pseudotime (DPT) method implemented in the R package destiny⁷⁸ to build a diffusion map with 20 DCs using the cell with maximum value in DC1 as the root cell for DPT estimation. To visualize the DCs in space, we added an estimated vector field to the segmented spatial graphs, with arrow sizes corresponding to the magnitude of change among nearby cells and directions corresponding to the direction with the largest change in the diffusion component. We then identified imputed genes strongly correlated with DPT (absolute Spearman correlation of >0.5) among either midbrain or hindbrain region cells. For smooth expression estimation along the DPT, we split cells into either midbrain or hindbrain regions and extracted fitted values from local regression (loess) for each gene with DPT ranking as the explanatory variable. To further identify genes associated with spatial variation in expression, we performed scHOT⁸¹ analysis using weighted mean as the underlying higher-order function, with a weighting span of 0.1 on spatial coordinates and using the imputed gene expression values. We then identified the 500 top-ranked significantly spatially variable genes (ensuring also that the FDR-adjusted P value was <0.05), clustered their smoothed expression using hierarchical clustering (Supplementary Table 8) and selected the number of clusters using dynamicTreeCut¹²⁰. To visualize spatial expression profiles of clusters, we calculated the mean inferred gene expression value for the genes associated with each cluster.

We downloaded the Nowotschin et al. 10x Genomics scRNA-seq counts and associated annotations from the corresponding Shiny web application (https://endoderm-explorer.com/)². We then subset down to E8.75 cells, considering each 10x Genomics sequencing library as a batch variable. We performed highly variable gene (HVG) selection using ‘modelGeneVar’ from the scran package¹⁰⁸ using the library sample as the blocking variable. We then selected the intersection of these HVGs and the genes appearing in the seqFISH dataset for further analysis. We concatenated the normalized log counts for the Nowotschin et al. and seqFISH datasets and performed dimensionality reduction to 50 PCs using ‘multiBatchNorm’ as implemented in scran¹⁰⁸. We then performed batch correction using the MNN approach, where the merge order was fixed to first integrate batches from the Nowotschin et al. dataset (ordered by decreasing cell number). We then identified the 10 nearest neighbors of the seqFISH cells to the Nowotschin et al. cells in the corrected reduced dimensional space. Using these nearest neighbors, we classified seqFISH gut tube cells to a cell type defined by Nowotschin et al. A ‘mapping score’ was computed for each cell as the proportion of the nearest neighbors in the Nowotschin et al. data corresponding to the selected class. We performed differential gene expression analysis between the lung 1 and lung 2 groups using ‘findMarkers’ in scran¹⁰⁸ and also performed differential gene expression analysis between the associated mesodermal cells at most three steps away from the lung 1 or lung 2 cells in the cell–cell neighborhood network.

To calculate the relative position of developing gut tube cells along the anterior–posterior axis, for each embryo, we performed a virtual dissection to visually identify the dorsal and ventral regions of the gut tube. Then, for each embryo and each dorsal or ventral tissue region, we fit a single principal curve model using the R package princurve¹²¹, with explanatory variables corresponding to the physical coordinates. We then extracted anterior–posterior cell rankings by taking the rank of the fitted arc length from the beginning of the curve, ensuring that the curve always began at the anterior-most position.

To further understand the relationship between the endodermal and mesodermal layers in the gut tube, we performed a joint analysis between the Nowotschin et al. data (described above) as well as the E8.5 splanchnic mesoderm cells from Han et al.³. For the Han et al. data, we performed HVG selection using ‘modelGeneVar’ from the scran package¹⁰⁸ using the library sample as the blocking variable and then selected the genes that appeared in either the HVG list for Nowotschin et al. or Han et al. and genes that were also present in the seqFISH gene library. We then concatenated the normalized log counts of all three datasets and performed integration (dimensionality reduction, batch correction, further dimensionality reduction for visualization) and cell classification as described above. Thus, for each seqFISH cell, we obtained a classified cell class according to the labels provided by Han et al., including mesodermal subtypes in the splanchnic mesoderm. To further investigate the surrounding mesodermal cells of the gut tube, we used the cell–cell neighborhood network to identify mesodermal cells at most three steps away from a gut tube cell and, for each of these cells, we identified their position as either dorsal or ventral to the gut tube and calculated the mean position along the anterior–posterior axis.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol