The SAMap algorithm contains three major steps: preprocessing, mutual nearest neighborhood alignment, and gene-gene correlation initialization. The latter two are repeated for three iterations, by default, to balance alignment performance and computational runtime.
We first construct a gene-gene bipartite graph between two species by performing reciprocal BLAST of their respective transcriptomes using tblastx, or proteomes using blastp. tblastn and blastx are used for BLAST between proteome and transcriptome. When a pair of genes share multiple High Scoring Pairs (HSPs), which are local regions of matching sequences, we use the HSP with the highest bit score to measure homology. Only pairs with E-value <10−6 are included in the graph.
Although we define similarity using BLAST, SAMap is compatible with other protein homology detection methods (e.g. HMMER [Eddy, 2008]) or orthology inference tools (e.g. OrthoClust [Yan et al., 2014] and eggNOG [Huerta-Cepas et al., 2019]). While each of these methods has known strengths and limitations, BLAST is chosen for its broad usage, technical convenience, and compatibility with low-quality transcriptomes.
We encode the BLAST results into two triangular adjacency matrices, and , each containing bit scores in one BLAST direction. We combine and to form a gene-gene adjacency matrix . After symmetrizing , we remove edges that only appear in one direction: , where only keeps reciprocal edges, and and are the number of genes of the two species, respectively. To filter out relatively weak homologies, we also remove edges where . Edge weights are then normalized by the maximum edge weight for each gene and transformed by a hyperbolic tangent function to increase discriminatory power between low and high edge weights, .
The single-cell RNAseq datasets are normalized such that each cell has a total number of raw counts equal to the median size of single-cell libraries. Gene expressions are then log-normalized with the addition of a pseudocount of 1. Genes expressed (i.e. ) in greater than 96% of cells are filtered out. SAM is run using the following parameters: preprocessing = ‘StandardScaler’, weight_PCs = False, k = 20, and npcs = 150. A detailed description of parameters is provided previously (Tarashansky et al., 2019). SAM outputs and , which are directed adjacency matrices that encode k-nearest neighbor graphs for the two datasets, respectively.
SAM only includes the top 3000 genes ranked by SAM weights and the first 150 principal components (PCs) in the default mode to reduce computational complexity. However, downstream mapping requires PC loadings for all genes. Thus, in the final iteration of SAM, we run PCA on all genes and take the top 300 PCs. This step generates a loading matrix for each species , .
For the gene expression matrices , where and are the number of cells and genes respectively, we first zero the expression of genes that do not have an edge in and standardize the expression matrices such that each gene has zero mean and unit variance, yielding . represents a bipartite graph in the form of , where is zero matrix and is the biadjacency matrix. Letting and encoding directed edges from species 1 to 2 and 2 to 1, respectively, we normalize the biadjacency matrix such that each row sums to 1: , where the function normalizes the rows to sum to 1. The feature spaces can be transformed between the two species via weighted averaging of gene expression, .
We project the expression data from two species into a joint PC space (Barkas et al., 2019), and . We then horizontally concatenate the principal components and to form .
Using the joint PCs, , we identify for each cell the -nearest neighbors in the other dataset using cosine similarity ( by default). Neighbors are identified using the hnswlib library, a fast approximate nearest-neighbor search algorithm (Malkov and Yashunin, 2020). This outputs two directed biadjacency matrices for or with edge weights equal to the cosine similarity between the PCs.
To increase the stringency and confidence of mapping, we only rely on cells that are mutual nearest cross-species neighbors, which are typically defined as two cells reciprocally connected to one another (Haghverdi et al., 2018). However, due to the noise in cell-cell correlations and stochasticity in the kNN algorithms, cross-species neighbors are often randomly assigned from a pool of cells that appear equally similar, decreasing the likelihood of mutual connectivity between individual cells even if they have similar expression profiles. To overcome this limitation, we integrate information from each cell’s local neighborhood to establish more robust mutual connectivity between cells across species. Two cells are thus defined as mutual nearest cross-species neighbors when their respective neighborhoods have mutual connectivity.
Specifically, the nearest neighbor graphs generated by SAM are used to calculate the neighbors of cells hops away along outgoing edges: , where are adjacency matrices that contain the number of paths connecting two cells hops away, for or 2. determines the length-scale over which we integrate incoming edges for species . Its default value is 2 if the dataset size is less than 20,000 cells and 3 otherwise. However, cells within tight clusters may have spurious edges connecting to other parts of the manifold only a few hops away. To avoid integrating neighborhood information outside this local structure, we use the Leiden algorithm (Traag et al., 2019) to cluster the graph and identify a local neighborhood size for each cell (the resolution parameter is set to 3 by default). If cell belongs to cluster , then its neighborhood size is . For each row in we only keep the geodesically closest cells, letting the pruned graph update .
Edges outgoing from cell in species are encoded in the corresponding row in the adjacency matrix: . We compute the fraction of the outgoing edges from each cell that target the local neighborhood of a cell in the other species: , where is the set of cells in the neighborhood of cell in species and is the fraction of outgoing edges from cell in species targeting the neighborhood of cell in species .
To reduce the density of so as to satisfy computational memory constraints, we remove edges with weight less than 0.1. Finally, we apply the mutual nearest neighborhood criterion by taking the element-wise, geometric mean of the two directed bipartite graphs: . This operation ensures that only bidirectional edges are preserved, as small edge weights in either direction results in small geometric means.
Given the mutual nearest neighborhoods , we select the k nearest neighborhoods for each cell in both directions to update the directed biadjacency matrices and : and , with by default.
We use and to combine the manifolds and into a unified graph. We first weight the edges in and to account for the number of shared cross-species neighbors by computing the one-mode projections of and . In addition, for cells with strong cross-species alignment, we attenuate the weight of their within-species edges. For cells with little to no cross-species alignment, their within-species are kept the same to ensure that the local topological information around cells with no alignment is preserved.
Specifically, we use and to mask the edges in the one-mode projections, and , where sets all edge weights in graph to 1 and normalizes the outgoing edges from each cell to sum to 1. The minimum edge weight is set to be 0.3 to ensure that neighbors in the original manifolds with no shared cross-species neighbors still retain connectivity: and for all edges . We then scale the within-species edges from cell by the total weight of its cross-species edges: and . Finally, the within- and cross-species graphs are stitched together to form the combined nearest neighbor graph : . The overall alignment score between species 1 and 2 is defined as .
To compute correlations between gene pairs, we first transfer expressions from one species to the other: , where is the imputed expressions of gene from species for cell in species , and is row of the biadjacency matrix encoding the cross-species neighbors of cell in species , all for and . We similarly use the manifolds constructed by SAM to smooth the within-species gene expressions using kNN averaging: , where is the nearest-neighbor graph for species . We then concatenate the within- and cross-species gene expressions such that the expression of gene from species in both species is .
For all gene pairs in the initial unpruned homology graph, , we compute their correlations, , where is a Heaviside step function centered at 0 to set negative correlations to zero. We then use the expression correlations to update the corresponding edge weights in , which are again normalized through .
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.