The SAMap algorithm

AT Alexander J Tarashansky
JM Jacob M Musser
MK Margarita Khariton
PL Pengyang Li
DA Detlev Arendt
SQ Stephen R Quake
BW Bo Wang
request Request a Protocol
ask Ask a question
Favorite

The SAMap algorithm contains three major steps: preprocessing, mutual nearest neighborhood alignment, and gene-gene correlation initialization. The latter two are repeated for three iterations, by default, to balance alignment performance and computational runtime.

We first construct a gene-gene bipartite graph between two species by performing reciprocal BLAST of their respective transcriptomes using tblastx, or proteomes using blastp. tblastn and blastx are used for BLAST between proteome and transcriptome. When a pair of genes share multiple High Scoring Pairs (HSPs), which are local regions of matching sequences, we use the HSP with the highest bit score to measure homology. Only pairs with E-value <10−6 are included in the graph.

Although we define similarity using BLAST, SAMap is compatible with other protein homology detection methods (e.g. HMMER [Eddy, 2008]) or orthology inference tools (e.g. OrthoClust [Yan et al., 2014] and eggNOG [Huerta-Cepas et al., 2019]). While each of these methods has known strengths and limitations, BLAST is chosen for its broad usage, technical convenience, and compatibility with low-quality transcriptomes.

We encode the BLAST results into two triangular adjacency matrices, A and B, each containing bit scores in one BLAST direction. We combine A and B to form a gene-gene adjacency matrix G. After symmetrizing G, we remove edges that only appear in one direction: G=Recip(12A+B+A+BT)Rm1+m2×m1+m2, where Recip only keeps reciprocal edges, and m1 and m2 are the number of genes of the two species, respectively. To filter out relatively weak homologies, we also remove edges where Gab<0.25maxb(Gab). Edge weights are then normalized by the maximum edge weight for each gene and transformed by a hyperbolic tangent function to increase discriminatory power between low and high edge weights, G^ab=0.5+0.5tanh(10Gab/maxb(Gab)-5).

The single-cell RNAseq datasets are normalized such that each cell has a total number of raw counts equal to the median size of single-cell libraries. Gene expressions are then log-normalized with the addition of a pseudocount of 1. Genes expressed (i.e. log2(D+1)>1) in greater than 96% of cells are filtered out. SAM is run using the following parameters: preprocessing = ‘StandardScaler’, weight_PCs = False, k = 20, and npcs = 150. A detailed description of parameters is provided previously (Tarashansky et al., 2019). SAM outputs N1 and N2, which are directed adjacency matrices that encode k-nearest neighbor graphs for the two datasets, respectively.

SAM only includes the top 3000 genes ranked by SAM weights and the first 150 principal components (PCs) in the default mode to reduce computational complexity. However, downstream mapping requires PC loadings for all genes. Thus, in the final iteration of SAM, we run PCA on all genes and take the top 300 PCs. This step generates a loading matrix for each species i, LiR300×mi.

For the gene expression matrices ZiRni×mi, where n and m are the number of cells and genes respectively, we first zero the expression of genes that do not have an edge in G^ and standardize the expression matrices such that each gene has zero mean and unit variance, yielding Z~i. G^ represents a bipartite graph in the form of G^=0m1,m1HRm1×m2HTRm2×m10m2,m2, where 0m,m is m×m zero matrix and H is the biadjacency matrix. Letting H1=H and H2=HT encoding directed edges from species 1 to 2 and 2 to 1, respectively, we normalize the biadjacency matrix Hi such that each row sums to 1: H^i=SumNorm(Hi)Rmi×mj, where the SumNorm function normalizes the rows to sum to 1. The feature spaces can be transformed between the two species via weighted averaging of gene expression, Z~ij=Z~iH^i.

We project the expression data from two species into a joint PC space (Barkas et al., 2019), Pi=Z~iLiT and Pij=Z~ijLjT. We then horizontally concatenate the principal components Pi and Pij to form P^iRni×600.

Using the joint PCs, P^i, we identify for each cell the k-nearest neighbors in the other dataset using cosine similarity (k=20 by default). Neighbors are identified using the hnswlib library, a fast approximate nearest-neighbor search algorithm (Malkov and Yashunin, 2020). This outputs two directed biadjacency matrices CiRni×nj for (i,j)=(1,2) or (2,1) with edge weights equal to the cosine similarity between the PCs.

To increase the stringency and confidence of mapping, we only rely on cells that are mutual nearest cross-species neighbors, which are typically defined as two cells reciprocally connected to one another (Haghverdi et al., 2018). However, due to the noise in cell-cell correlations and stochasticity in the kNN algorithms, cross-species neighbors are often randomly assigned from a pool of cells that appear equally similar, decreasing the likelihood of mutual connectivity between individual cells even if they have similar expression profiles. To overcome this limitation, we integrate information from each cell’s local neighborhood to establish more robust mutual connectivity between cells across species. Two cells are thus defined as mutual nearest cross-species neighbors when their respective neighborhoods have mutual connectivity.

Specifically, the nearest neighbor graphs Ni generated by SAM are used to calculate the neighbors of cells ti hops away along outgoing edges: N¯i=Niti, where N¯i are adjacency matrices that contain the number of paths connecting two cells ti hops away, for i=1 or 2. ti determines the length-scale over which we integrate incoming edges for species i. Its default value is 2 if the dataset size is less than 20,000 cells and 3 otherwise. However, cells within tight clusters may have spurious edges connecting to other parts of the manifold only a few hops away. To avoid integrating neighborhood information outside this local structure, we use the Leiden algorithm (Traag et al., 2019) to cluster the graph and identify a local neighborhood size for each cell (the resolution parameter is set to 3 by default). If cell a belongs to cluster ca, then its neighborhood size is la=|ca|. For each row a in N¯i we only keep the la geodesically closest cells, letting the pruned graph update N^i.

Edges outgoing from cell ai in species i are encoded in the corresponding row in the adjacency matrix: Ci,ai. We compute the fraction of the outgoing edges from each cell that target the local neighborhood of a cell in the other species: C~i,aibj=cXj,bjCi,aic, where Xj,bj is the set of cells in the neighborhood of cell bj in species j and C~i,aibj is the fraction of outgoing edges from cell ai in species i targeting the neighborhood of cell bj in species j.

To reduce the density of C~i so as to satisfy computational memory constraints, we remove edges with weight less than 0.1. Finally, we apply the mutual nearest neighborhood criterion by taking the element-wise, geometric mean of the two directed bipartite graphs: C~=C~1C~2. This operation ensures that only bidirectional edges are preserved, as small edge weights in either direction results in small geometric means.

Given the mutual nearest neighborhoods C~Rn1×n2, we select the k nearest neighborhoods for each cell in both directions to update the directed biadjacency matrices C1 and C2: C1=KNN(C~,k) and C2=KNN(C~T,k), with k=20 by default.

We use C1 and C2 to combine the manifolds N1 and N2 into a unified graph. We first weight the edges in N1 and N2 to account for the number of shared cross-species neighbors by computing the one-mode projections of C1 and C2. In addition, for cells with strong cross-species alignment, we attenuate the weight of their within-species edges. For cells with little to no cross-species alignment, their within-species are kept the same to ensure that the local topological information around cells with no alignment is preserved.

Specifically, we use N1 and N2 to mask the edges in the one-mode projections, N~1=U(N1)(Norm(C1)Norm(C2)) and N~2=U(N2)(Norm(C2)Norm(C1)), where U(E) sets all edge weights in graph E to 1 and Norm normalizes the outgoing edges from each cell to sum to 1. The minimum edge weight is set to be 0.3 to ensure that neighbors in the original manifolds with no shared cross-species neighbors still retain connectivity: N1,ij=min(0.3,N1,ij) and N2,ij=min(0.3,N2,ij) for all edges (i,j). We then scale the within-species edges from cell i by the total weight of its cross-species edges: N~1,i=(1-1kj=1n2C1,ij)N~1,i and N~2,i=(1-1kj=1n1C2,ij)N~2,i. Finally, the within- and cross-species graphs are stitched together to form the combined nearest neighbor graph N: N=[N~1C1][C2N~2]. The overall alignment score between species 1 and 2 is defined as S=1n1+n2(i=1n1j=1n2C1,ij+i=1n2j=1n1C2,ij).

To compute correlations between gene pairs, we first transfer expressions from one species to the other: Z¯i,nimj=Ci,niZj,mj, where Z¯i,nimj is the imputed expressions of gene mj from species j for cell ni in species i, and Ci,ni is row ni of the biadjacency matrix encoding the cross-species neighbors of cell ni in species i, all for (i,j)=(1,2) and (2,1). We similarly use the manifolds constructed by SAM to smooth the within-species gene expressions using kNN averaging: Z¯j,mj=Nj,mjZj,mj, where Nj is the nearest-neighbor graph for species j. We then concatenate the within- and cross-species gene expressions such that the expression of gene mj from species j in both species is Z¯mj=Z¯i,mjZ¯j,mj.

For all gene pairs in the initial unpruned homology graph, G^, we compute their correlations, G^ab:=θ(0)Corr(Z¯a,Z¯b), where θ(0) is a Heaviside step function centered at 0 to set negative correlations to zero. We then use the expression correlations to update the corresponding edge weights in G^, which are again normalized through G^ab=0.5+0.5tanh(10G^ab/maxb(G^ab)-5).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A