Preprocessing of data, weighted nearest neighbors (WNNs) and smoothing

Chen Li; Maria C. Virgilio; Kathleen L. Collins; Joshua D. Welch

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Preprocessing of data, weighted nearest neighbors (WNNs) and smoothing

CL Chen Li

MV Maria C. Virgilio

KC Kathleen L. Collins

JW Joshua D. Welch

This method is extracted from research article: Nat Biotechnol, Oct 2022

Multi-omic single-cell velocity models epigenome–transcriptome interactions and improves cell fate prediction

DOI: 10.1038/s41587-022-01476-y

Request a Protocol

Ask a question

Favorite

Filtered expression matrix for ATAC-seq, feature linkage file, as well as position-sorted RNA alignment (BAM) file of E18 mouse embryonic brain data of around 5,000 cells were downloaded from 10x Genomics website (CellRanger ARC 1.0.0). Total, unspliced and spliced RNA reads were separately quantified using the Velocyto run10x command. The resulting loom file was read into python as an AnnData object and preprocessed with scanpy and scVelo to perform filtering, normalization and nearest neighbor assignment. Next, clusters were computed using the Leiden⁵⁵ algorithm. Cell types were manually annotated based on expression of known marker genes^56–59. We then excluded interneurons, Cajal-Retzius and microglia cell populations for our downstream analyses, because these cell types are not actively differentiating. We then reprocessed the raw counts of subset clusters, which consists of more than 3,000 remaining cells, with scVelo. The unspliced and spliced reads were neighborhood smoothed (averaged) by scVelo’s pp.moments method with 30 principal components among 50 neighbors. The downloaded feature linkage file contains correlation information for gene-peak pairs of genomic features across cells. We first collected all distal putative enhancer peaks (not in promoter or gene body regions) with ≥0.5 correlation with either promoter accessibility or gene expression that were annotated to the same gene or within 10 kilobase pairs of that gene. We then aggregated these enhancer peaks with 10x annotated promoter peaks for the corresponding genes, as a single chromatin accessibility modality to boost chromatin signal. These aggregated accessibility values were then normalized using the term frequency-inverse document frequency (TF-IDF) method²⁸. (Note that during fitting, chromatin values are normalized to [0, 1], so using other total-count based normalization will produce identical results.)

Due to the increased sparsity of ATAC-seq data, the neighborhood graph and clustering results based solely on peaks is often noisy and unreliable. Seurat group recently developed a method to compute neighborhood assignments for simultaneously measured multi-modality data in the Seurat V4 toolkit, which they called weighted nearest neighbor, or WNN⁶⁰. The WNN method learns weights of each cell in either modality based on its predictive power by neighboring cells in each of the modalities, so that both RNA and ATAC information can be incorporated when assigning neighbors. We used 50 WNNs obtained from Seurat for each cell to smooth the aggregated and normalized chromatin peak values. Our WNN analysis followed the recommended steps in Seurat V4 vignette for 10x RNA + ATAC. We thus obtained three matrices containing chromatin accessibility, unspliced and spliced counts. Shared cell barcodes and genes were filtered among matrices and resulted in 3,365 cells and 936 highly variable genes, and these matrices were then used for dynamical modeling.

The quantified ATAC-seq expression matrix, raw ATAC-seq fragments file and cell annotations of SHARE-seq mouse skin dataset⁹ were downloaded from the Gene Expression Omnibus (GEO). The RNA alignment BAM file as well as UMAP coordinates for TAC, IRS, medulla and hair shaft cuticle/cortex cell populations used in the SHARE-seq manuscript were obtained directly from the authors. We run Velocyto to quantify unspliced and spliced counts, and the RNA AnnData object was further preprocessed with scanpy/scVelo for the four cell types of interest. In R, the chromatin fragment file was used to construct a gene activity matrix by aggregating peaks onto gene coordinates using the GeneActivity function in Signac⁶¹. Domain of regulatory chromatin (DORC) is defined as chromatin regions that contain clusters of peaks that are highly correlated with gene expressions in SHARE-seq’s analysis. A list of computed DORCs coordinates was downloaded from its supplementary material section. These coordinates were output to the bed format, and we extracted fragments together with their corresponding cell barcodes that overlap with these DORCs regions. A peak expression matrix for DORCs was constructed with LIGER’s⁶² makeFeatureMatrix method. The gene activity and DORCs counts were then merged in python to form a single chromatin modality. Similar to brain data, this matrix underwent TF-IDF normalization and WNN smoothing. A total of 6,436 cells and 962 genes participated in the downstream analyses. When computing the Spearman correlation with pseudotime based methods, only genes with likelihood higher than 0.07 were kept, resulting in 140 velocity genes. This filtering step ensures a fair comparison with scVelo by using a small set of high-quality genes.

Purified human CD34⁺ cells were purchased from the Fred Hutch Hematology Core B. Freshly thawed cells from a single donor were either immediately prepared for single-cell processing or maintained at 37°C with 5% CO₂ in Stemspan II medium supplemented with 100 ng ml⁻¹ stem cell factor, 100 ng ml⁻¹ thrombopoietin, 100 ng ml⁻¹ Flt3 ligand (all from Stemcell Technologies) and 100 ng ml⁻¹ insulin-like growth factor binding protein 2 (R&D Systems) for seven days. HSPCs were prepared according to the manufacturer’s ‘10x Genomics Nuclei Isolation Single Cell multi-ome ATAC + Gene Expression Sequencing’ demonstrated protocol. Briefly, cells were washed in PBS supplemented with 0.04% BSA and filtered with a Flowmi Cell Strainer (Bel-Art) (day 0) or sorted using the Sony SH800 cell sorter (Sony Biotechnologies) (day 7). Nuclei were isolated following the ‘Low Cell Input Nuclei Isolation’ sub-protocol and immediately processed using the Chromium Next GEM Single Cell Multiome + Gene Expression kit.

10x filtered expression matrices, Velocyto computed unspliced and spliced counts and feature linkage and peak annotation files from CellRanger ARC 2.0.0 were read into python to construct RNA and ATAC AnnData objects. Filtering, normalization and variable-gene selection were performed following scVelo’s online tutorial. Because HSPCs are rapidly proliferating, we noticed systematic differences in cell cycle stage across the set of cells. The cell cycle scores for both G2M and S phases, computed using scVelo’s tl.score_genes_cell_cycle function were then regressed out of the RNA expression matrices with scanpy’s pp.regress_out function (Extended Data Fig. 2b). Note that the regression did not change unspliced and spliced counts. Then gene expression scaling was performed. ATAC peaks were aggregated and normalized using the same procedure as described for the 10x mouse brain. Joint filtering between RNA and ATAC resulted in 11,605 cells and 1,000 genes. RNA expression was smoothed by scVelo’s pp.moments with 30 principal components and 50 neighbors. Leiden found 11 clusters. Cell types were assigned based on canonical HSPC markers^63–67. The chromatin accessibility matrix was WNN smoothed with 50 neighbors computed using Seurat. Then the RNA and ATAC objects were input to our dynamical function with default parameters. We relaxed the likelihood threshold for velocity genes (used for computing the velocity graph) to 0.02 compared to the default of 0.05 due to noisiness of this dataset.

To find complete genes in each of the lineages from HSC toward GMPs (myeloid), erythrocytes and platelets, we subset cells of each specific lineage and select known complete genes as those genes that have higher unspliced and spliced expressions in the progenitor populations leading to each of the terminal cell types. We then ran the model predetermination algorithm based on peak chromatin accessibility as described in the previous section. The genes predicted as model 1 and model 2 for each lineage are then merged with duplicates removed, and we performed gene ontology enrichment analysis (GOrilla⁶⁸) using all sequenced genes as the background set.

Preprocessed bulk chromatin immunoprecipitation with sequencing (ChIP-seq) peaks of H3K4me3, H3K4me1 and H3K27ac for CD34⁺ HSPCs were downloaded from GEO:GSE70677⁶⁹. Peaks were mapped to genes with Homer⁷⁰. Known complete genes in the myeloid and erythroid lineages were grouped together, and predicted model 1 and model 2 genes were extracted. Scores of peaks associated with the same genes were aggregated. Wilcoxon rank-sum test was used to compute significance.

The day 0 Multiome HSPC sample was integrated with the day 7 sample (Fig. 5) using the Seurat⁴² anchors workflow (FindIntegrationAnchors and IntegrateData) to remove technical differences due to batch effects. Each of the full RNA expression matrices, unspliced and spliced matrices, as well as gene-aggregated ATAC-seq matrix, were integrated and imputed independently between the two samples. A total of 2,000 RNA genes and 5,000 ATAC genes were used for integration with the day 7 sample as the reference dataset. The two raw ATAC-seq peak matrices were integrated using the IntegrateEmbeddings method in Signac⁶¹, and together with the integrated full RNA matrix, the WNNs were found. These outputs were then read into python and processed using the standard MultiVelo procedure. The UMAP resulting from the Seurat anchors RNA integration was used for plotting (Extended Data Fig. 2d). Seurat cluster results were used for cell types.

We obtained the multi-omic RNA, unspliced, spliced and ATAC-seq peak files from the authors. The ATAC peak matrix contains consensus peaks of nonoverlapping uniform 500 bp length. After initial clustering, we observed a severe batch effect in one of the three samples. We thus decided to remove this third sample and perform all downstream analyses with the two remaining samples (dc2r2_r1 and dc2r2_r2). We renamed the clusters from the original paper as follows based on marker gene expression: RG → RG/Astro, nIPC/GluN1 → nIPC/ExN, GluN3 → ExM, GluN2 → ExUp, GluN4 and GluN5 → ExDp⁵⁷. Peaks were annotated to genes with Homer⁷⁰. We considered peaks within 10,000 bp of transcription start sites as promoter peaks. A list of peak-gene links and correlations were downloaded from the supplementary material and aggregated to promoter peaks if the correlation exceeded 0.4. After filtering the RNA and ATAC matrices, 4,693 cells and 919 genes were left and input to model fitting.

TF motif profiles were computed with chromVAR⁴⁴ on the JASPAR2020 database⁷¹ using all consensus peaks. The background-corrected deviation z-scores were used as normalized motif accessibilities, and the values were smoothed with WNN. Then TF genes appearing in the variable-gene list (after internal filtering by the dynamical function) were extracted for time-lag analysis, which resulted in 30 known motifs.

All mental or behavioral disorder-associated SNPs (EFO_0000677) were downloaded from the Ensembl GWAS Catalog. The list contains 6,968 SNPs, and filtering for overlap with consensus peaks linked to the top genes resulted in 757 SNPs. Each SNP’s accessibility was quantified as the count of all ATAC fragments that overlap a 400-bp bin centered on the SNP location. The accessibility matrix was normalized by library size and smoothed by WNN neighbors.

Downstream target genes for EGR1, EOMES, FOXP2 and PBX3 were found through a literature search. A number of these targets were confirmed by ChIP-seq experiments^72–75. The time delay between expression of a TF and expression of its downstream targets was quantified with DTW.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol