The SAMap algorithm

Alexander J Tarashansky; Jacob M Musser; Margarita Khariton; Pengyang Li; Detlev Arendt; Stephen R Quake; Bo Wang

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

The SAMap algorithm

AT Alexander J Tarashansky

JM Jacob M Musser

MK Margarita Khariton

PL Pengyang Li

DA Detlev Arendt

SQ Stephen R Quake

BW Bo Wang

This method is extracted from research article: eLife, May 2021

Mapping single-cell atlases throughout Metazoa unravels cell type evolution

DOI: 10.7554/eLife.66747

Request a Protocol

Ask a question

Favorite

The SAMap algorithm contains three major steps: preprocessing, mutual nearest neighborhood alignment, and gene-gene correlation initialization. The latter two are repeated for three iterations, by default, to balance alignment performance and computational runtime.

We first construct a gene-gene bipartite graph between two species by performing reciprocal BLAST of their respective transcriptomes using tblastx, or proteomes using blastp. tblastn and blastx are used for BLAST between proteome and transcriptome. When a pair of genes share multiple High Scoring Pairs (HSPs), which are local regions of matching sequences, we use the HSP with the highest bit score to measure homology. Only pairs with E-value <10⁻⁶ are included in the graph.

Although we define similarity using BLAST, SAMap is compatible with other protein homology detection methods (e.g. HMMER [Eddy, 2008]) or orthology inference tools (e.g. OrthoClust [Yan et al., 2014] and eggNOG [Huerta-Cepas et al., 2019]). While each of these methods has known strengths and limitations, BLAST is chosen for its broad usage, technical convenience, and compatibility with low-quality transcriptomes.

We encode the BLAST results into two triangular adjacency matrices, $A$ and $B$ , each containing bit scores in one BLAST direction. We combine $A$ and $B$ to form a gene-gene adjacency matrix $G$ . After symmetrizing $G$ , we remove edges that only appear in one direction: $G = R e c i p (\frac{1}{2} [(A + B) + {(A + B)}^{T}]) \in R^{m_{1} + m_{2} \times m_{1} + m_{2}}$ , where $R e c i p$ only keeps reciprocal edges, and $m_{1}$ and $m_{2}$ are the number of genes of the two species, respectively. To filter out relatively weak homologies, we also remove edges where $G_{a b} < 0.25 \underset{b}{m a x} (G_{a b})$ . Edge weights are then normalized by the maximum edge weight for each gene and transformed by a hyperbolic tangent function to increase discriminatory power between low and high edge weights, ${\hat{G}}_{a b} = {0.5 + 0.5 t a n h (10 G}_{a b} / \underset{b}{m a x} (G_{a b}) - 5)$ .

The single-cell RNAseq datasets are normalized such that each cell has a total number of raw counts equal to the median size of single-cell libraries. Gene expressions are then log-normalized with the addition of a pseudocount of 1. Genes expressed (i.e. $l o g_{2} (D + 1) > 1$ ) in greater than 96% of cells are filtered out. SAM is run using the following parameters: preprocessing = ‘StandardScaler’, weight_PCs = False, k = 20, and npcs = 150. A detailed description of parameters is provided previously (Tarashansky et al., 2019). SAM outputs $N_{1}$ and $N_{2}$ , which are directed adjacency matrices that encode k-nearest neighbor graphs for the two datasets, respectively.

SAM only includes the top 3000 genes ranked by SAM weights and the first 150 principal components (PCs) in the default mode to reduce computational complexity. However, downstream mapping requires PC loadings for all genes. Thus, in the final iteration of SAM, we run PCA on all genes and take the top 300 PCs. This step generates a loading matrix for each species $i$ , $L_{i} \in R^{300 \times m_{i}}$ .

For the gene expression matrices $Z_{i} \in R^{n_{i} \times m_{i}}$ , where $n_{}$ and $m$ are the number of cells and genes respectively, we first zero the expression of genes that do not have an edge in $\hat{G}$ and standardize the expression matrices such that each gene has zero mean and unit variance, yielding ${\tilde{Z}}_{i}$ . $\hat{G}$ represents a bipartite graph in the form of $\hat{G} = [\begin{matrix} 0_{m_{1}, m_{1}} & H \in R^{m_{1} \times m_{2}} \\ H^{T} \in R^{m_{2} \times m_{1}} & 0_{m_{2}, m_{2}} \end{matrix}]$ , where $0_{m, m}$ is $m \times m$ zero matrix and $H$ is the biadjacency matrix. Letting $H_{1} = H$ and $H_{2} = H^{T}$ encoding directed edges from species 1 to 2 and 2 to 1, respectively, we normalize the biadjacency matrix $H_{i}$ such that each row sums to 1: ${\hat{H}}_{i} = S u m N o r m (H_{i}) \in R^{m_{i} \times m_{j}}$ , where the $S u m N o r m$ function normalizes the rows to sum to 1. The feature spaces can be transformed between the two species via weighted averaging of gene expression, ${\tilde{Z}}_{i j} = {\tilde{Z}}_{i} {\hat{H}}_{i}$ .

We project the expression data from two species into a joint PC space (Barkas et al., 2019), $P_{i} = {\tilde{Z}}_{i} {L_{i}}^{T}$ and $P_{i j} = {\tilde{Z}}_{i j} {L_{j}}^{T}$ . We then horizontally concatenate the principal components $P_{i}$ and $P_{i j}$ to form ${\hat{P}}_{i} \in R^{n_{i} \times 600}$ .

Using the joint PCs, ${\hat{P}}_{i}$ , we identify for each cell the $k$ -nearest neighbors in the other dataset using cosine similarity ( $k = 20$ by default). Neighbors are identified using the hnswlib library, a fast approximate nearest-neighbor search algorithm (Malkov and Yashunin, 2020). This outputs two directed biadjacency matrices $C_{i} \in R^{n_{i} \times n_{j}}$ for $(i, j) = (1, 2)$ or $(2, 1)$ with edge weights equal to the cosine similarity between the PCs.

To increase the stringency and confidence of mapping, we only rely on cells that are mutual nearest cross-species neighbors, which are typically defined as two cells reciprocally connected to one another (Haghverdi et al., 2018). However, due to the noise in cell-cell correlations and stochasticity in the kNN algorithms, cross-species neighbors are often randomly assigned from a pool of cells that appear equally similar, decreasing the likelihood of mutual connectivity between individual cells even if they have similar expression profiles. To overcome this limitation, we integrate information from each cell’s local neighborhood to establish more robust mutual connectivity between cells across species. Two cells are thus defined as mutual nearest cross-species neighbors when their respective neighborhoods have mutual connectivity.

Specifically, the nearest neighbor graphs $N_{i}$ generated by SAM are used to calculate the neighbors of cells $t_{i}$ hops away along outgoing edges: ${\bar{N}}_{i} = {N_{i}}^{t_{i}}$ , where ${\bar{N}}_{i}$ are adjacency matrices that contain the number of paths connecting two cells $t_{i}$ hops away, for $i = 1$ or 2. $t_{i}$ determines the length-scale over which we integrate incoming edges for species $i$ . Its default value is 2 if the dataset size is less than 20,000 cells and 3 otherwise. However, cells within tight clusters may have spurious edges connecting to other parts of the manifold only a few hops away. To avoid integrating neighborhood information outside this local structure, we use the Leiden algorithm (Traag et al., 2019) to cluster the graph and identify a local neighborhood size for each cell (the resolution parameter is set to 3 by default). If cell $a$ belongs to cluster $c_{a}$ , then its neighborhood size is $l_{a} = | c_{a} |$ . For each row $a$ in ${\bar{N}}_{i}$ we only keep the $l_{a}$ geodesically closest cells, letting the pruned graph update ${\hat{N}}_{i}$ .

Edges outgoing from cell $a_{i}$ in species $i$ are encoded in the corresponding row in the adjacency matrix: $C_{i, a_{i}}$ . We compute the fraction of the outgoing edges from each cell that target the local neighborhood of a cell in the other species: ${\tilde{C}}_{i, a_{i} b_{j}} = \sum_{c \in X_{j, b_{j}}}^{} C_{i, a_{i} c}$ , where $X_{j, b_{j}}$ is the set of cells in the neighborhood of cell $b_{j}$ in species $j$ and ${\tilde{C}}_{i, a_{i} b_{j}}$ is the fraction of outgoing edges from cell $a_{i}$ in species $i$ targeting the neighborhood of cell $b_{j}$ in species $j$ .

To reduce the density of ${\tilde{C}}_{i}$ so as to satisfy computational memory constraints, we remove edges with weight less than 0.1. Finally, we apply the mutual nearest neighborhood criterion by taking the element-wise, geometric mean of the two directed bipartite graphs: $\tilde{C} = \sqrt{{\tilde{C}}_{1} \circ {\tilde{C}}_{2}}$ . This operation ensures that only bidirectional edges are preserved, as small edge weights in either direction results in small geometric means.

Given the mutual nearest neighborhoods $\tilde{C} \in R^{n_{1} \times n_{2}}$ , we select the k nearest neighborhoods for each cell in both directions to update the directed biadjacency matrices $C_{1}$ and $C_{2}$ : $C_{1} = K N N (\tilde{C}, k)$ and $C_{2} = K N N ({\tilde{C}}^{T}^{}, k)$ , with $k = 20$ by default.

We use $C_{1}$ and $C_{2}$ to combine the manifolds $N_{1}$ and $N_{2}$ into a unified graph. We first weight the edges in $N_{1}$ and $N_{2}$ to account for the number of shared cross-species neighbors by computing the one-mode projections of $C_{1}$ and $C_{2}$ . In addition, for cells with strong cross-species alignment, we attenuate the weight of their within-species edges. For cells with little to no cross-species alignment, their within-species are kept the same to ensure that the local topological information around cells with no alignment is preserved.

Specifically, we use $N_{1}$ and $N_{2}$ to mask the edges in the one-mode projections, ${\tilde{N}}_{1} = {U (N}_{1}) \circ (N o r m (C_{1}) N o r m (C_{2}))$ and ${\tilde{N}}_{2} = {U (N}_{2}) \circ (N o r m (C_{2}) N o r m (C_{1}))$ , where $U (E)$ sets all edge weights in graph $E$ to 1 and $N o r m$ normalizes the outgoing edges from each cell to sum to 1. The minimum edge weight is set to be 0.3 to ensure that neighbors in the original manifolds with no shared cross-species neighbors still retain connectivity: ${\tilde{N}}_{1, i j} = m i n (0.3, {\tilde{N}}_{1, i j})$ and ${\tilde{N}}_{2, i j} = m i n (0.3, {\tilde{N}}_{2, i j})$ for all edges $(i, j)$ . We then scale the within-species edges from cell $i$ by the total weight of its cross-species edges: ${\tilde{N}}_{1, i} = (1 - \frac{1}{k} \sum_{j = 1}^{n_{2}} C_{1, i j}) {\tilde{N}}_{1, i}$ and ${\tilde{N}}_{2, i} = (1 - \frac{1}{k} \sum_{j = 1}^{n_{1}} C_{2, i j}) {\tilde{N}}_{2, i}$ . Finally, the within- and cross-species graphs are stitched together to form the combined nearest neighbor graph $N$ : $N = [{\tilde{N}}_{1} \oplus C_{1}] \oplus [C_{2} \oplus {\tilde{N}}_{2}]$ . The overall alignment score between species 1 and 2 is defined as $S = \frac{1}{n_{1} + n_{2}} (\sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} C_{1, i j} + \sum_{i = 1}^{n_{2}} \sum_{j = 1}^{n_{1}} C_{2, i j})$ .

To compute correlations between gene pairs, we first transfer expressions from one species to the other: ${\bar{Z}}_{i, n_{i} m_{j}} = C_{i, n_{i}} Z_{j, m_{j}}$ , where ${\bar{Z}}_{i, n_{i} m_{j}}$ is the imputed expressions of gene $m_{j}$ from species $j$ for cell $n_{i}$ in species $i$ , and $C_{i, n_{i}}$ is row $n_{i}$ of the biadjacency matrix encoding the cross-species neighbors of cell $n_{i}$ in species $i$ , all for $(i, j) = (1,2)$ and $(2, 1)$ . We similarly use the manifolds constructed by SAM to smooth the within-species gene expressions using kNN averaging: ${\bar{Z}}_{j, m_{j}} = N_{j, m_{j}} Z_{j, m_{j}}$ , where $N_{j}$ is the nearest-neighbor graph for species $j$ . We then concatenate the within- and cross-species gene expressions such that the expression of gene $m_{j}$ from species $j$ in both species is ${\bar{Z}}_{m_{j}} {= \bar{Z}}_{i, m_{j}} \oplus {\bar{Z}}_{j, m_{j}}$ .

For all gene pairs in the initial unpruned homology graph, $\hat{G}$ , we compute their correlations, ${\hat{G}}_{a b} : = θ (0) C o r r ({\bar{Z}}_{a}, {\bar{Z}}_{b})$ , where $θ (0)$ is a Heaviside step function centered at 0 to set negative correlations to zero. We then use the expression correlations to update the corresponding edge weights in $\hat{G}$ , which are again normalized through ${\hat{G}}_{a b} = 0.5 + 0.5 t a n h (10 \hat{G}_{a b} / \underset{b}{m a x} ({\hat{G}}_{a b}) - 5)$ .

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol