SnapATAC pipeline workflow

RF Rongxin Fang
SP Sebastian Preissl
YL Yang Li
XH Xiaomeng Hou
JL Jacinta Lucero
XW Xinxin Wang
AM Amir Motamedi
AS Andrew K. Shiau
XZ Xinzhu Zhou
FX Fangming Xie
EM Eran A. Mukamel
KZ Kai Zhang
YZ Yanxiao Zhang
MB M. Margarita Behrens
JE Joseph R. Ecker
BR Bing Ren
request Request a Protocol
ask Ask a question
Favorite

Using a custom python script, we first perform FASTQ file demulticomplexing by integrating the cell barcode into the read name in the following format: @+barcode+:+original_read_name.

Demultiplexed sequencing reads are aligned to the corresponding reference genome (i.e., mm10 or hg19) using bwa (0.7.13-r1126) in pair-end mode with default parameter settings. Aligned reads are then sorted based on the read name using samtools (v1.9) to group together reads originating from the same barcodes.

Pair-end reads are converted into fragments and only those that meet the following criteria are kept: (1) properly paired (according to SMA flag value); (2) uniquely mapped (MAPQ > 30); (3) insert distance within [50–1000 bp]. PCR duplicates (fragments sharing the same genomic coordinates) are removed for each cell separately. Given that Tn5 introduces a 9 bp staggered, reads mapping to the positive and negative strand were shifted by +4/−5 bp, respectively48.

We identify the high-quality cells based on two criteria: (1) total number of unique fragment count [>1000]; (2) fragments in promoter ratio—the percentage of fragments overlapping with annotated promoter regions [0.2–0.8]. The promoter regions used in this study are downloaded from 10× genomics for hg19 and mm10.

Using the remaining fragments, we next generate a snap-format (Single-Nucleus Accessibility Profiles) file using snaptools (https://github.com/r3fang/SnapTools). A snap file is a hierarchically structured hdf5 file that contains the following sections: header (HD), cell-by-bin matrix (BM), cell-by-peak matrix (PM), cell-by-gene matrix (GM), barcode (BD), and fragment (FM). HD session contains snap-file version, date, alignment and reference genome information. BD session contains all unique barcodes and corresponding meta data. BM session contains cell-by-bin matrices of different bin sizes. PM session contains cell-by-peak count matrix. GM session contains cell-by-gene count matrix. FM session contains all usable fragments for each cell. Fragments are indexed based on barcodes that enables fast retrieval of reads belonging to the same barcodes. Detailed information about snap file can be found here: https://github.com/r3fang/SnapTools/blob/master/docs/snap_format.docx.

Using the resulting snap file, we next create cell-by-bin count matrix. The genome is segmented into uniform-sized bins and single-cell ATAC-seq profiles are represented as cell-by-bin matrix with each element indicating number of sequencing fragments overlapping with a given bin in a certain cell.

To evaluate the effect of bin size to clustering performance, we apply SnapATAC to three datasets namely 5 K PBMC (10×), Mouse Brain (10×), and MOs-M1 (snATAC) (data source listed in Supplementary Table 1). These datasets are generated by both plate and droplet platforms using either cell or nuclei with considerably different depth, allowing us to systematically evaluate the effect of bin size.

For each dataset, we define the “landmark” cell types in a supervised manner. First, we perform cis-Topic16 for dimensionality reduction and identify cell clusters using graph-based algorithm Louvain49 with k = 15. Second, we manually define the major cell types in each dataset by examining the gene accessibility score at the canonical marker genes (see Supplementary Fig. 9 as an example for MOs-M1). Third, clusters sharing the same marker genes are manually merged and those failing to show unique signatures are discarded. In total, we define nine most convincing cell types in PBMC 5 K (10×), 14 types in Mouse Brain 5 K (10×) and 14 types in MOs-M1 (snATAC). Among these cell types, 14 cell populations that account for less than 2% of the total population are considered as rare cell populations (Supplementary Fig. 2a).

We next evaluate the performance of each bin size selection using three metrics: (1) cluster connectivity index (CI), which estimate the degree of connectedness of the landmark cell types; a lower CI represents a better separation. The connectivity index is computed in the following manner. For each cell i, the K (K = 15) nearest neighbors are found and sorted from the closest to furthest. The algorithm checks if those neighbors are assigned to the same cluster with cell i. At the beginning, connectivity value equals 0 and increases with value 1/i when the ith nearest neighbors is not assigned to the same cluster with cell i. This procedure is repeated for all cells in the dataset. In general, the higher the connectivity index is, the less separated the defined landmark cell types are. The connectivity index is computed using “connectivity” function implemented in R package clv. (2) coverage bias, which estimates the read depth distribution in the two-dimensional embedding space; (3) sensitivity to identify rare populations. Through systematic benchmarking, we found that bin size in the range from 1 to 10 kb appeared to work well on the three benchmarks, we selected 5 kb as the default bin width for all the analysis in this work (Supplementary Fig. 2 and Methods section).

We found that the vast majority of the elements in the cell-by-bin count matrix is “0”, indicating either closed chromatin or missing value. Among the non-zero elements, some has abnormally high coverage (>200) perhaps due to the alignment errors or other unknown reasons. These items usually account for less than 0.1% of total non-zero items in the matrix. Thus, we change the top 0.1% elements in the matrix to “0” to eliminate potential alignment errors. We next convert the remaining non-zero elements to “1”.

We next filter out any bins overlapping with the ENCODE blacklist downloaded from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/ for corresponding reference genome. Second, we remove reads mapped to the X/Y chromosomes to eliminate sex effect. Third, we remove mitochondrial DNA to get rid of potential contamination. We next sort the bins based on the coverage and filter out the top 5% to remove the invariant features. Please note that we do not perform coverage-based bin filtering for a dataset that has low coverage (average fragment number less than 5000) where the ranking of bin may be fluctuated by the noise.

We next apply the following dimensionality reduction procedure to project the high-dimension data to a low-dimension manifold for clustering and visualization. Let XRn×m be a dataset with n cells and m bins and X={0,1}. The first step is to compute a similarity matrix between the m high-dimensional data points to construct the n-by-n pairwise similarity matrix using a kernel function kn that is an appropriate similarity metric. A popular choice is gaussian kernel:

where d(xi,xj) is the Euclidean distance between observations i and j.

Due the binarization nature of single-cell ATAC-seq dataset, in this case, we replace the Gaussian kernel with Jaccard coefficient, which estimates the similarity between cells simply based on ratio of overlap over the total union:

The Jaccard coefficient, which is symmetric and positivity preserving meets the requirement of being a kernel function.

Using jaccard as a kernel function, we next form a symmetric kernel matrix JRn×n where each entry is obtained as Ji,j=jaccard(xi,xj).

Theoretically, the similarity Ji,j would reflect the true similarity between cell xi and xj, but unfortunately, due to the high-dropout rate, this is not the case. If there is a high sequencing depth for cell xi or xj, then Ji,j tend to have higher values, regardless whether cell xi and xj is actually similar or not.

This can be proved theoretically. Given 2 cells xi and xj and corresponding coverage (number of “1”s) Ci=kmxik and Cj=kmxjk, let Pi=Ci/m and Pj=Cj/m be the probability of observing a signal in cell xi and xj where m is the length of the vector. Assuming xi and xj are two “random” cells without any biological relevance, in another word, the “1”s in xi and xj are randomly distributed, then the ratio of expectation between cell xi and xj can be calculated as:

Although the ratio of expectations does not in general equal the expectation of the ratio, the two are approximately equal in this case because the coefficient of variation is much less than 1 for both the numerator and the denominator. The increase of either Pi or Pj will result in an increase of Eij, which suggests the Jaccard similarity between cells is highly affected by the read depth. Such observation prompts us to develop an ad hoc normalization method to eliminate the read depth effect.

To learn the relationship between the Eij and Jij from the data, we next fit a curve to predict the observed Jaccard coefficient Jij as a function of its expected value Eij by fitting a polynomials regression of degree 2 using R function lm. Theoretically, Eij should be linear with Jij if cells are completely random, but in real dataset, we have observed a nonlinearity between Eij and Jij especially among the high-coverage cells. We suspect, to some extent, the degree of randomness of fragment distribution in a single cell is associated with the coverage. To better model the nonlinearity, we include a second order polynomial in our model:

This fitting provided estimators of parameters {β0^,β1^,β2^}. As such, we next use it to normalize the observed Jaccard coefficient by:

The fitting of the linear regression, however, can be time consuming with a large matrix. Here we test the possibility of performing this step on a random subset of y cells in lieu of the full matrix. When selecting a subset of y cells to speed up the first step, we do not select cells at random with a uniform sampling probability. Instead, we set the probability of selecting a cell i to

where d is the density estimate of all log10-transformed cell fragment count and Ci is the number of fragments in cell i and Ci=kmxik. Similar approach was first introduced in SCTranscform50 to speed up the normalization of single-cell RNA-seq.

We then proceed to normalize the full Jaccard coefficient matrix JRn×n using the regression model learned from y cells and compared the results to the case where all cells are used in the initial estimation step as well. We use the correlation of normalized Jaccard coefficient to compare this partial analysis to the full analysis. We observe that using as few as 2000 cells in the estimation gave rise to virtually identical estimates. We therefore use 2000 cells in the initial model-fitting step. To remove outliers in the normalized similarity, we use the 0.99 quantile to cap the maximum value of the normalized matrix.

Next, using normalized Jaccard coefficient matrix N, we normalize the matrix by:

where DRn×n is a diagonal matrix, which is composed as Di,i=jNi,j. We next perform eigenvector decomposition against A.

The columns φiRn of URn×n are the eigenvectors. The diagonal matrix ΛRn×n has the eigenvalues λ1λ20 in descending order as its entries. Finally, we report the first r eigenvectors as the final low-dimension manifold.

To assess the performance of normalization of SnapATAC we processed three datasets. As shown in Supplementary Fig. 3, before normalization, SnapATAC exhibits a strong gradient that is correlated with sequencing depth within the cluster (Supplementary Fig. 3a). Although the sequencing depth effect is still observed in some of the small clusters, it is clear that the normalization has largely eliminated the read depth effect as compared to the unnormalized ones (Supplementary Fig. 3b).

To better quantify the coverage bias, we next computed the Shannon entropy that estimates the “uniformness” of the distribution of cell coverage in the UMAP embedding space. In detail, we first chose the top 10% cells of the highest coverage as “high-coverage” cells. Second, in the 2D UMAP embedding space, we discretize “high-coverage” cells from a continuous random coordinate (umap1, umap2) into bins (n = 50) and returns the corresponding vector of counts. This is done using a function called “discretize2d” in the “entropy” R package. Third, we estimated the Shannon entropy of the random variable from the corresponding observed counts. This is done using function “entropy” in the “entropy” R package. A higher entropy indicates that the “high-coverage” cells are more uniformly distributed in the UMAP embedding space, overall suggesting a better normalization performance.

We next examine another eight possible sources of biases by projecting to the UMAP embedding space, some metrics show cluster specificity for all three methods perhaps due to biological relevance, but all three methods can reveal significant biological heterogeneity without exhibiting substantial intracluster bias for any metrics examined (Supplementary Fig. 4).

When the technical variability is at a larger scale than the biological variability, we apply batch effect corrector—Harmony24—to eliminate such confounding factor. Given two datasets generated using different technologies, we first calculate the joint low-dimension manifold as described above. We next apply Harmony to regress out the batch effect, resulting in a new harmonized coembedding. This is implemented as a function “runHarmony” in SnapATAC package.

We next determine how many eigenvectors to include for the downstream analysis. Here we use an ad hoc approach for choosing the optimal number of components. We look at the scatter plot between every two pairs of eigenvectors and choose the number of eigenvectors that start exhibiting “blob”-like structure in which no obvious biological structure is revealed.

The computational cost of the dimensionality reduction scales quadratically with the increase of number of cells. For instance, calculating and normalizing the pairwise kernel Matrix N becomes computationally infeasible for large-scale dataset. To overcome this limitation, here we apply Nyström method21,51 to calculate the low-dimensional embedding for large-scale dataset.

A Nyström algorithm can be divided into three major steps: (i) sampling: sample a subset of K (KN) cells from N total cells as “landmarks”. Instead of random sampling, here we adopt a density-based sampling approach50 to preserve the density distribution of the N original points; (ii) embedding: compute the low-dimension embedding for K landmarks; (iii) extension: project the remaining NK cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

This approach significantly reduces the computational complexity and memory usage given that K is considerably smaller than N. The out-of-sample extension (step iii) further enables projection of new single-cell ATAC-seq datasets to the existing reference single-cell atlas. This allows us to further develop a supervised approach to predict cell types of a new single-cell ATAC-seq dataset based on an existing reference atlas.

A key aspect of this method is the procedure according to which cells are sampled as landmark cells, because different sampled landmark cells give different approximations of the original embedding using full matrix. Here we employ the density-based sampling as described above, which preserves the density distribution of the original points.

Let XRn×m be a dataset with n cells and m variables (bins) and NRn×n be a symmetric kernel matrix calculated using normalized Jaccard coefficient. To avoid calculating the pairwise kernel matrix and performing eigen-decomposition against a big matrix NRn×n, we first sample k(kn) landmarks without replacement. This breaks down the original kernel matrix NRn×n into four components.

in which NkkRk×k is the pairwise kernel matrix between k landmarks and NvkR(nk)×k is the similarity matrix between (nk) cells and k landmarks. Using Nkk, we perform dimensionality reduction to obtain the r-rank manifold UkkRk×r as described above.

Using Nvk, which estimates the similarity between nk cells and k landmark cells, we project the rest of nk cells to the embedding previously obtained using k landmark:

where DvvR(nk)×(nk) is a diagonal matrix which is composed as Di,ivk=jNi,jvk. The projected coordinates of the new points onto the r-dimensional intrinsic manifold defined by the landmarks are then given by

The resulting UvkR(nk)×r is the approximate r-rank low-dimension representation of the rest nk cells. Combing Ukk and Uvk creates a joint embedding space for all cells:

In the approximate joint r-rank embedding space U~, we next create a k-nearest neighbor (KNN) graph in which every cell is represented as a node and edges are drawn between cells within k-nearest neighbors defined using Euclidean distance. Finally, we apply community finding algorithm such as Louvain (implemented by igraph package in R) to identify the ‘communities’ in the resulting graph, which represents groups of cells sharing similar profiles, potentially originating from the same cell type.

To evaluate the effect of the number of landmarks, we apply our method to a complex dataset that contains over 80k cells from 13 different mouse tissues. We employ the following three metrics to evaluate the performance. First, using different number of landmarks (k) ranging from 1000 to 10,000, we compare the clustering outcome to the cell-type label defined in the original study. The goal of this is to identify the “elbow” point that performance drops abruptly. Second, for each sampling, we repeat for five times using different set of landmarks to evaluate stability between sampling. Third, we spiked-in 1% Patski cells to assess the sensitivity of identifying rare cell types. We choose Patski cells because these cells were profiled using the same protocol by the same group (Data source listed in Supplementary Table 1) to minimize the batch effect.

We observe that using as few as 5000 landmarks can largely recapitulate the result obtained using 10,000 landmarks (Supplementary Fig. 5a), and 10,000 landmarks can achieve highly robust embedding between sampling (Supplementary Fig. 5b) and successfully recover spiked-in rare populations (Supplementary Fig. 5c). To obtain a reliable low-dimensional embedding, we use 10,000 landmarks for all the analysis performed in this study.

Nyström is stochastic in its nature, different sampling will result in different embedding and clustering outcome. To improve the robustness of the clustering method, we next employ Ensemble Nyström Algorithm which combines a mixture of Nyström approximation to create an ensemble representation52. Supported by theoretical analysis, this Ensemble approach has been demonstrated to guarantee a convergence and in a faster rate in comparison to standard Nyström method52. Moreover, this ensemble algorithm naturally fits within distributed computing environments, where their computational costs are roughly the same as that of the standard Nyström single sampling method.

We treat each approximation generated by the Nyström method using k landmarks as an expert and combined p ≥ 1 such experts to derive an improved approximation, typically more accurate than any of the original experts52.

The ensemble setup is defined as follows. Given a dataset XRn×m of n cells. Each expert Sj receives k landmarks randomly selected from matrix X using density-based sampling approach without replacement. Each expert Sr, r[1,p] is then used to define the low-dimension embedding U~jRn×r as described above. For each low-dimension embedding U~jRn×r, we create a KNN-graph as G~j. Thus, the general form of the approximation, G~en, generated by the ensemble Nyström method is

Here we choose to use the most straightforward method by assigning an equal weight to each of the KNN-graph obtained from different samplings, μj=1/p,r[1,p]. While this choice ignores the relative quality of each Nyström approximation, it is computational efficient and already generates a solution superior to any one of the approximations used in the combination. Using the ensemble weighted KNN-graph G~en, we next apply community finding algorithm to identify cell clusters. By testing on the mouse atlas dataset8, we demonstrate that the clustering stability of the ensemble approach is significantly higher than the standard Nyström method (Supplementary Fig. 5d).

We use the t-SNE implemented by FI-tsne, Rtsne or UMAP (umap_0.2.0.0) to visualize and explore the dataset.

To annotate the identified clusters, SnapATAC calculated the gene-body accessibility matrix G using “calGmatFromMat” function in SnapATAC packge where Gi,j is the number of fragments overlapping with jth genes in i-th cell. Gi,j is then normalized to CPM (count-per-million reads) as G~. The normalized accessibility score is then smoothed using Markov affinity-graph-based method:

where A is the adjacent matrix obtained from K-nearest neighbor graph and t is number of steps taken for Markov diffusion process. We set t = 3 in this study. Please note that the gene accessibility score is only used to guide the annotation of cell clusters identified using cell-by-bin matrix. The clusters are identified using cell-by-bin matrix in prior.

After annotation, cells from the same cluster are pooled to create aggregated signal for each of the identified cell types. This allows for identifying cis-elements from each cluster. MACS2 (version 2.1.2) is used for generating signal tracks and peak calling with the following parameters: “–nomodel–shift 100–ext 200–qval 1e-2 -B –SPMR”. This can be done by “runMACS” function in SnapATAC package.

SnapATAC incorporates chromVAR15 to estimate the motif variability and Homer22 for de novo motif discovery. This is implemented as function “runChromVAR” and “runHomer” in SnapATAC package.

For a given group of cells Ci, we first look for their neighboring cells Cj (Ci=Cj) in the low-dimension manifold as “background” cells to compare to. If Ciaccounts for more than half of the total cells, we use the remaining cells as local background. Next, we aggregate Ci and Cj to create two raw-count vectors as Vci and Vcj. We then perform differential analysis between Vci and Vcj using exact test as implemented in R package edgeR (v3.18.1) with BCV = 0.1. P-value is then adjusted into False Discovery Rate (FDR) using Benjamini–Hochberg correction. Peaks with FDR less than 0.05 are selected as significant DARs.

SnapATAC incorporates GREAT analysis27 to infer the candidate biological pathway active in each cell populations. This is implemented as function “runGREAT” SnapATAC package.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A