SnapATAC pipeline workflow

Rongxin Fang; Sebastian Preissl; Yang Li; Xiaomeng Hou; Jacinta Lucero; Xinxin Wang; Amir Motamedi; Andrew K. Shiau; Xinzhu Zhou; Fangming Xie; Eran A. Mukamel; Kai Zhang; Yanxiao Zhang; M. Margarita Behrens; Joseph R. Ecker; Bing Ren

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

SnapATAC pipeline workflow

RF Rongxin Fang

SP Sebastian Preissl

YL Yang Li

XH Xiaomeng Hou

JL Jacinta Lucero

XW Xinxin Wang

AM Amir Motamedi

AS Andrew K. Shiau

XZ Xinzhu Zhou

FX Fangming Xie

EM Eran A. Mukamel

KZ Kai Zhang

YZ Yanxiao Zhang

MB M. Margarita Behrens

JE Joseph R. Ecker

BR Bing Ren

This method is extracted from research article: Nat Commun, Feb 2021

Comprehensive analysis of single cell ATAC-seq data with SnapATAC

DOI: 10.1038/s41467-021-21583-9

Request a Protocol

Ask a question

Favorite

Using a custom python script, we first perform FASTQ file demulticomplexing by integrating the cell barcode into the read name in the following format: @+barcode+:+original_read_name.

Demultiplexed sequencing reads are aligned to the corresponding reference genome (i.e., mm10 or hg19) using bwa (0.7.13-r1126) in pair-end mode with default parameter settings. Aligned reads are then sorted based on the read name using samtools (v1.9) to group together reads originating from the same barcodes.

Pair-end reads are converted into fragments and only those that meet the following criteria are kept: (1) properly paired (according to SMA flag value); (2) uniquely mapped (MAPQ > 30); (3) insert distance within [50–1000 bp]. PCR duplicates (fragments sharing the same genomic coordinates) are removed for each cell separately. Given that Tn5 introduces a 9 bp staggered, reads mapping to the positive and negative strand were shifted by +4/−5 bp, respectively^⁴⁸.

We identify the high-quality cells based on two criteria: (1) total number of unique fragment count [>1000]; (2) fragments in promoter ratio—the percentage of fragments overlapping with annotated promoter regions [0.2–0.8]. The promoter regions used in this study are downloaded from 10× genomics for hg19 and mm10.

Using the remaining fragments, we next generate a snap-format (Single-Nucleus Accessibility Profiles) file using snaptools (https://github.com/r3fang/SnapTools). A snap file is a hierarchically structured hdf5 file that contains the following sections: header (HD), cell-by-bin matrix (BM), cell-by-peak matrix (PM), cell-by-gene matrix (GM), barcode (BD), and fragment (FM). HD session contains snap-file version, date, alignment and reference genome information. BD session contains all unique barcodes and corresponding meta data. BM session contains cell-by-bin matrices of different bin sizes. PM session contains cell-by-peak count matrix. GM session contains cell-by-gene count matrix. FM session contains all usable fragments for each cell. Fragments are indexed based on barcodes that enables fast retrieval of reads belonging to the same barcodes. Detailed information about snap file can be found here: https://github.com/r3fang/SnapTools/blob/master/docs/snap_format.docx.

Using the resulting snap file, we next create cell-by-bin count matrix. The genome is segmented into uniform-sized bins and single-cell ATAC-seq profiles are represented as cell-by-bin matrix with each element indicating number of sequencing fragments overlapping with a given bin in a certain cell.

To evaluate the effect of bin size to clustering performance, we apply SnapATAC to three datasets namely 5 K PBMC (10×), Mouse Brain (10×), and MOs-M1 (snATAC) (data source listed in Supplementary Table ¹). These datasets are generated by both plate and droplet platforms using either cell or nuclei with considerably different depth, allowing us to systematically evaluate the effect of bin size.

For each dataset, we define the “landmark” cell types in a supervised manner. First, we perform cis-Topic^¹⁶ for dimensionality reduction and identify cell clusters using graph-based algorithm Louvain^⁴⁹ with k = 15. Second, we manually define the major cell types in each dataset by examining the gene accessibility score at the canonical marker genes (see Supplementary Fig. ⁹ as an example for MOs-M1). Third, clusters sharing the same marker genes are manually merged and those failing to show unique signatures are discarded. In total, we define nine most convincing cell types in PBMC 5 K (10×), 14 types in Mouse Brain 5 K (10×) and 14 types in MOs-M1 (snATAC). Among these cell types, 14 cell populations that account for less than 2% of the total population are considered as rare cell populations (Supplementary Fig. ^2a).

We next evaluate the performance of each bin size selection using three metrics: (1) cluster connectivity index (CI), which estimate the degree of connectedness of the landmark cell types; a lower CI represents a better separation. The connectivity index is computed in the following manner. For each cell i, the K (K = 15) nearest neighbors are found and sorted from the closest to furthest. The algorithm checks if those neighbors are assigned to the same cluster with cell i. At the beginning, connectivity value equals 0 and increases with value 1/i when the ith nearest neighbors is not assigned to the same cluster with cell i. This procedure is repeated for all cells in the dataset. In general, the higher the connectivity index is, the less separated the defined landmark cell types are. The connectivity index is computed using “connectivity” function implemented in R package clv. (2) coverage bias, which estimates the read depth distribution in the two-dimensional embedding space; (3) sensitivity to identify rare populations. Through systematic benchmarking, we found that bin size in the range from 1 to 10 kb appeared to work well on the three benchmarks, we selected 5 kb as the default bin width for all the analysis in this work (Supplementary Fig. ² and Methods section).

We found that the vast majority of the elements in the cell-by-bin count matrix is “0”, indicating either closed chromatin or missing value. Among the non-zero elements, some has abnormally high coverage (>200) perhaps due to the alignment errors or other unknown reasons. These items usually account for less than 0.1% of total non-zero items in the matrix. Thus, we change the top 0.1% elements in the matrix to “0” to eliminate potential alignment errors. We next convert the remaining non-zero elements to “1”.

We next filter out any bins overlapping with the ENCODE blacklist downloaded from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/ for corresponding reference genome. Second, we remove reads mapped to the X/Y chromosomes to eliminate sex effect. Third, we remove mitochondrial DNA to get rid of potential contamination. We next sort the bins based on the coverage and filter out the top 5% to remove the invariant features. Please note that we do not perform coverage-based bin filtering for a dataset that has low coverage (average fragment number less than 5000) where the ranking of bin may be fluctuated by the noise.

We next apply the following dimensionality reduction procedure to project the high-dimension data to a low-dimension manifold for clustering and visualization. Let $X \in R^{n \times m}$ be a dataset with n cells and m bins and $X = {0, 1}$ . The first step is to compute a similarity matrix between the m high-dimensional data points to construct the n-by-n pairwise similarity matrix using a kernel function kn that is an appropriate similarity metric. A popular choice is gaussian kernel:

where $d (x_{i}, x_{j})$ is the Euclidean distance between observations i and j.

Due the binarization nature of single-cell ATAC-seq dataset, in this case, we replace the Gaussian kernel with Jaccard coefficient, which estimates the similarity between cells simply based on ratio of overlap over the total union:

The Jaccard coefficient, which is symmetric and positivity preserving meets the requirement of being a kernel function.

Using jaccard as a kernel function, we next form a symmetric kernel matrix $J \in R^{n \times n}$ where each entry is obtained as $J_{i, j} = jaccard (x_{i}, x_{j})$ .

Theoretically, the similarity J_i,j would reflect the true similarity between cell x_i and x_j, but unfortunately, due to the high-dropout rate, this is not the case. If there is a high sequencing depth for cell x_i or x_j, then J_i,j tend to have higher values, regardless whether cell x_i and x_j is actually similar or not.

This can be proved theoretically. Given 2 cells x_i and x_j and corresponding coverage (number of “1”s) $C_{i} = \sum_{k}^{m} x_{i k}$ and $C_{j} = \sum_{k}^{m} x_{j k}$ , let $P_{i} = C_{i} / m$ and $P_{j} = C_{j} / m$ be the probability of observing a signal in cell x_i and x_j where m is the length of the vector. Assuming x_i and x_j are two “random” cells without any biological relevance, in another word, the “1”s in x_i and x_j are randomly distributed, then the ratio of expectation between cell x_i and x_j can be calculated as:

Although the ratio of expectations does not in general equal the expectation of the ratio, the two are approximately equal in this case because the coefficient of variation is much less than 1 for both the numerator and the denominator. The increase of either P_i or P_j will result in an increase of E_ij, which suggests the Jaccard similarity between cells is highly affected by the read depth. Such observation prompts us to develop an ad hoc normalization method to eliminate the read depth effect.

To learn the relationship between the E_ij and J_ij from the data, we next fit a curve to predict the observed Jaccard coefficient J_ij as a function of its expected value E_ij by fitting a polynomials regression of degree 2 using R function lm. Theoretically, E_ij should be linear with J_ij if cells are completely random, but in real dataset, we have observed a nonlinearity between E_ij and J_ij especially among the high-coverage cells. We suspect, to some extent, the degree of randomness of fragment distribution in a single cell is associated with the coverage. To better model the nonlinearity, we include a second order polynomial in our model:

This fitting provided estimators of parameters ${\hat{β_{0}}, \hat{β_{1}}, \hat{β_{2}}}$ . As such, we next use it to normalize the observed Jaccard coefficient by:

The fitting of the linear regression, however, can be time consuming with a large matrix. Here we test the possibility of performing this step on a random subset of y cells in lieu of the full matrix. When selecting a subset of y cells to speed up the first step, we do not select cells at random with a uniform sampling probability. Instead, we set the probability of selecting a cell i to

where d is the density estimate of all log10-transformed cell fragment count and C_i is the number of fragments in cell i and $C_{i} = \sum_{k}^{m} x_{i k}$ . Similar approach was first introduced in SCTranscform^⁵⁰ to speed up the normalization of single-cell RNA-seq.

We then proceed to normalize the full Jaccard coefficient matrix $J \in R^{n \times n}$ using the regression model learned from y cells and compared the results to the case where all cells are used in the initial estimation step as well. We use the correlation of normalized Jaccard coefficient to compare this partial analysis to the full analysis. We observe that using as few as 2000 cells in the estimation gave rise to virtually identical estimates. We therefore use 2000 cells in the initial model-fitting step. To remove outliers in the normalized similarity, we use the 0.99 quantile to cap the maximum value of the normalized matrix.

Next, using normalized Jaccard coefficient matrix N, we normalize the matrix by:

where $D \in R^{n \times n}$ is a diagonal matrix, which is composed as $D_{i, i} = \sum_{j} N_{i, j}$ . We next perform eigenvector decomposition against A.

The columns $φ_{i} \in R^{n}$ of $U \in R^{n \times n}$ are the eigenvectors. The diagonal matrix $Λ \in R^{n \times n}$ has the eigenvalues $λ_{1} \geq λ_{2} \geq \dots \geq 0$ in descending order as its entries. Finally, we report the first r eigenvectors as the final low-dimension manifold.

To assess the performance of normalization of SnapATAC we processed three datasets. As shown in Supplementary Fig. ³, before normalization, SnapATAC exhibits a strong gradient that is correlated with sequencing depth within the cluster (Supplementary Fig. ^3a). Although the sequencing depth effect is still observed in some of the small clusters, it is clear that the normalization has largely eliminated the read depth effect as compared to the unnormalized ones (Supplementary Fig. ^3b).

To better quantify the coverage bias, we next computed the Shannon entropy that estimates the “uniformness” of the distribution of cell coverage in the UMAP embedding space. In detail, we first chose the top 10% cells of the highest coverage as “high-coverage” cells. Second, in the 2D UMAP embedding space, we discretize “high-coverage” cells from a continuous random coordinate (umap1, umap2) into bins (n = 50) and returns the corresponding vector of counts. This is done using a function called “discretize2d” in the “entropy” R package. Third, we estimated the Shannon entropy of the random variable from the corresponding observed counts. This is done using function “entropy” in the “entropy” R package. A higher entropy indicates that the “high-coverage” cells are more uniformly distributed in the UMAP embedding space, overall suggesting a better normalization performance.

We next examine another eight possible sources of biases by projecting to the UMAP embedding space, some metrics show cluster specificity for all three methods perhaps due to biological relevance, but all three methods can reveal significant biological heterogeneity without exhibiting substantial intracluster bias for any metrics examined (Supplementary Fig. ⁴).

When the technical variability is at a larger scale than the biological variability, we apply batch effect corrector—Harmony^²⁴—to eliminate such confounding factor. Given two datasets generated using different technologies, we first calculate the joint low-dimension manifold as described above. We next apply Harmony to regress out the batch effect, resulting in a new harmonized coembedding. This is implemented as a function “runHarmony” in SnapATAC package.

We next determine how many eigenvectors to include for the downstream analysis. Here we use an ad hoc approach for choosing the optimal number of components. We look at the scatter plot between every two pairs of eigenvectors and choose the number of eigenvectors that start exhibiting “blob”-like structure in which no obvious biological structure is revealed.

The computational cost of the dimensionality reduction scales quadratically with the increase of number of cells. For instance, calculating and normalizing the pairwise kernel Matrix N becomes computationally infeasible for large-scale dataset. To overcome this limitation, here we apply Nyström method^²¹,⁵¹ to calculate the low-dimensional embedding for large-scale dataset.

A Nyström algorithm can be divided into three major steps: (i) sampling: sample a subset of K ( $K ≪ N$ ) cells from N total cells as “landmarks”. Instead of random sampling, here we adopt a density-based sampling approach^⁵⁰ to preserve the density distribution of the N original points; (ii) embedding: compute the low-dimension embedding for K landmarks; (iii) extension: project the remaining N–K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

This approach significantly reduces the computational complexity and memory usage given that K is considerably smaller than N. The out-of-sample extension (step iii) further enables projection of new single-cell ATAC-seq datasets to the existing reference single-cell atlas. This allows us to further develop a supervised approach to predict cell types of a new single-cell ATAC-seq dataset based on an existing reference atlas.

A key aspect of this method is the procedure according to which cells are sampled as landmark cells, because different sampled landmark cells give different approximations of the original embedding using full matrix. Here we employ the density-based sampling as described above, which preserves the density distribution of the original points.

Let $X \in R^{n \times m}$ be a dataset with n cells and m variables (bins) and $N \in R^{n \times n}$ be a symmetric kernel matrix calculated using normalized Jaccard coefficient. To avoid calculating the pairwise kernel matrix and performing eigen-decomposition against a big matrix $N \in R^{n \times n}$ , we first sample $k (k ≪ n)$ landmarks without replacement. This breaks down the original kernel matrix $N \in R^{n \times n}$ into four components.

in which $N^{k k} \in R^{k \times k}$ is the pairwise kernel matrix between k landmarks and $N^{v k} \in R^{(n - k) \times k}$ is the similarity matrix between $(n - k)$ cells and k landmarks. Using N^kk, we perform dimensionality reduction to obtain the r-rank manifold $U^{k k} \in R^{k \times r}$ as described above.

Using N^vk, which estimates the similarity between n–k cells and k landmark cells, we project the rest of n–k cells to the embedding previously obtained using k landmark:

where $D^{v v} \in R^{(n - k) \times (n - k)}$ is a diagonal matrix which is composed as $D_{i, i}^{v k} = \sum_{j} N_{i, j}^{v k}$ . The projected coordinates of the new points onto the r-dimensional intrinsic manifold defined by the landmarks are then given by

The resulting $U^{v k} \in R^{(n - k) \times r}$ is the approximate r-rank low-dimension representation of the rest n–k cells. Combing U^kk and U^vk creates a joint embedding space for all cells:

In the approximate joint r-rank embedding space $\tilde{U}$ , we next create a k-nearest neighbor (KNN) graph in which every cell is represented as a node and edges are drawn between cells within k-nearest neighbors defined using Euclidean distance. Finally, we apply community finding algorithm such as Louvain (implemented by igraph package in R) to identify the ‘communities’ in the resulting graph, which represents groups of cells sharing similar profiles, potentially originating from the same cell type.

To evaluate the effect of the number of landmarks, we apply our method to a complex dataset that contains over 80k cells from 13 different mouse tissues. We employ the following three metrics to evaluate the performance. First, using different number of landmarks (k) ranging from 1000 to 10,000, we compare the clustering outcome to the cell-type label defined in the original study. The goal of this is to identify the “elbow” point that performance drops abruptly. Second, for each sampling, we repeat for five times using different set of landmarks to evaluate stability between sampling. Third, we spiked-in 1% Patski cells to assess the sensitivity of identifying rare cell types. We choose Patski cells because these cells were profiled using the same protocol by the same group (Data source listed in Supplementary Table ¹) to minimize the batch effect.

We observe that using as few as 5000 landmarks can largely recapitulate the result obtained using 10,000 landmarks (Supplementary Fig. ^5a), and 10,000 landmarks can achieve highly robust embedding between sampling (Supplementary Fig. ^5b) and successfully recover spiked-in rare populations (Supplementary Fig. ^5c). To obtain a reliable low-dimensional embedding, we use 10,000 landmarks for all the analysis performed in this study.

Nyström is stochastic in its nature, different sampling will result in different embedding and clustering outcome. To improve the robustness of the clustering method, we next employ Ensemble Nyström Algorithm which combines a mixture of Nyström approximation to create an ensemble representation^⁵². Supported by theoretical analysis, this Ensemble approach has been demonstrated to guarantee a convergence and in a faster rate in comparison to standard Nyström method^⁵². Moreover, this ensemble algorithm naturally fits within distributed computing environments, where their computational costs are roughly the same as that of the standard Nyström single sampling method.

We treat each approximation generated by the Nyström method using k landmarks as an expert and combined p ≥ 1 such experts to derive an improved approximation, typically more accurate than any of the original experts^⁵².

The ensemble setup is defined as follows. Given a dataset $X \in R^{n \times m}$ of n cells. Each expert S_j receives k landmarks randomly selected from matrix X using density-based sampling approach without replacement. Each expert S_r, $r \in [1, p]$ is then used to define the low-dimension embedding ${\tilde{U}}_{j} \in R^{n \times r}$ as described above. For each low-dimension embedding ${\tilde{U}}_{j} \in R^{n \times r}$ , we create a KNN-graph as ${\tilde{G}}_{j}$ . Thus, the general form of the approximation, ${\tilde{G}}^{en}$ , generated by the ensemble Nyström method is

Here we choose to use the most straightforward method by assigning an equal weight to each of the KNN-graph obtained from different samplings, $μ^{j} = 1 / p, r \in [1, p]$ . While this choice ignores the relative quality of each Nyström approximation, it is computational efficient and already generates a solution superior to any one of the approximations used in the combination. Using the ensemble weighted KNN-graph ${\tilde{G}}^{en}$ , we next apply community finding algorithm to identify cell clusters. By testing on the mouse atlas dataset^⁸, we demonstrate that the clustering stability of the ensemble approach is significantly higher than the standard Nyström method (Supplementary Fig. ^5d).

We use the t-SNE implemented by FI-tsne, Rtsne or UMAP (umap_0.2.0.0) to visualize and explore the dataset.

To annotate the identified clusters, SnapATAC calculated the gene-body accessibility matrix G using “calGmatFromMat” function in SnapATAC packge where G_i,j is the number of fragments overlapping with jth genes in i-th cell. G_i,j is then normalized to CPM (count-per-million reads) as $\tilde{G}$ . The normalized accessibility score is then smoothed using Markov affinity-graph-based method:

where A is the adjacent matrix obtained from K-nearest neighbor graph and t is number of steps taken for Markov diffusion process. We set t = 3 in this study. Please note that the gene accessibility score is only used to guide the annotation of cell clusters identified using cell-by-bin matrix. The clusters are identified using cell-by-bin matrix in prior.

After annotation, cells from the same cluster are pooled to create aggregated signal for each of the identified cell types. This allows for identifying cis-elements from each cluster. MACS2 (version 2.1.2) is used for generating signal tracks and peak calling with the following parameters: “–nomodel–shift 100–ext 200–qval 1e-2 -B –SPMR”. This can be done by “runMACS” function in SnapATAC package.

SnapATAC incorporates chromVAR^¹⁵ to estimate the motif variability and Homer^²² for de novo motif discovery. This is implemented as function “runChromVAR” and “runHomer” in SnapATAC package.

For a given group of cells C_i, we first look for their neighboring cells C_j ( $∣ C_{i} ∣ = ∣ C_{j} ∣$ ) in the low-dimension manifold as “background” cells to compare to. If C_iaccounts for more than half of the total cells, we use the remaining cells as local background. Next, we aggregate C_i and C_j to create two raw-count vectors as V_ci and V_cj. We then perform differential analysis between V_ci and V_cj using exact test as implemented in R package edgeR (v3.18.1) with BCV = 0.1. P-value is then adjusted into False Discovery Rate (FDR) using Benjamini–Hochberg correction. Peaks with FDR less than 0.05 are selected as significant DARs.

SnapATAC incorporates GREAT analysis^²⁷ to infer the candidate biological pathway active in each cell populations. This is implemented as function “runGREAT” SnapATAC package.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol