Human integrated periodontitis atlas and mouse keratinocyte atlas generation and analysis

QE Quinn T. Easter
BM Bruno Fernandes Matuck
GS Germán Beldorati Stark
CW Catherine L. Worth
AP Alexander V. Predeus
BF Brayon Fremin
KH Khoa Huynh
VR Vaishnavi Ranganathan
ZR Zhi Ren
DP Diana Pereira
BR Brittany T. Rupp
TW Theresa Weaver
KM Kathryn Miller
PP Paola Perez
AH Akira Hasuike
ZC Zhaoxu Chen
MB Mandy Bush
XQ Xufeng Qu
JL Janice Lee
SR Scott H. Randell
SW Shannon M. Wallet
IS Inês Sequeira
HK Hyun Koo
KT Katarzyna M. Tyc
JL Jinze Liu
KK Kang I. Ko
ST Sarah A. Teichmann
KB Kevin M. Byrd
request Request a Protocol
ask Ask a question
Favorite

Raw fastq files for the previously published single-cell RNA sequencing projects were downloaded and processed using scripts available here: https://github.com/cellgeni/reprocess_public_10x. Briefly, series metadata was collected using the GEO soft family file. Following this, ENA web API was used to obtain information about the format in which raw data is available for every run (SRR/ERR), as well as to infer the sample-to-run relationships. Raw read files were then downloaded in one of the three formats: 1) SRA read archive; 2) submitter-provided 10X BAM file; 3) gzipped paired-end fastq files. SRA archives were converted to fastq using fastq-dump utility from NCBI SRA tools v2.11.0 using “-F --split-files” options. BAM files were converted to fastq using 10X bamtofastq utility v1.3.2. Following this, raw reads were mapped and quantified using the STARsolo algorithm. STAR version 2.7.10a_alpha_220818 compiled from source files with the “-msse4.2” flag was used for all samples. Wrapper scripts documented in https://github.com/cellgeni/STARsolo/ were used to auto-detect 10x kit versions, appropriate whitelists, and other relevant sample characteristics. The human reference genome and annotation exactly matching Cell Ranger 2020-A was prepared as described by 10x Genomics: https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#header. For 10x samples, the STARsolo command was optimized to generate the results maximally like Cell Ranger v6. Namely, “--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30” were used to specify UMI collapsing, barcode collapsing, and read clipping algorithms. For paired-end 5’ 10x samples, options “--soloBarcodeMate 1 --clip5pNbases 39 0” were used to clip the adapter and perform paired-end alignment. For cell filtering, the EmptyDrops algorithm employed in Cell Ranger v4 and above was invoked using “--soloCellFilter EmptyDrops_CR” options. Options “--soloFeatures Gene GeneFull Velocyto” were used to generate both exon-only and full-length (pre-mRNA) gene counts, as well as RNA velocity output matrices.

The single-cell RNA-seq dataset was processed, analyzed and visualized using the Cellenics® community instance (https://scp.biomage.net/) hosted by Biomage (https://biomage.net/), accessed between May 2022 and February 2024. The team of Cellenics included Alex Pickering, Iva Babukova, Pol Alvarez Vecino, Martin Fosco, Anugerah Erlaut, Germán Beldorati Stark, Sara Castellano, Stefan Babukov, Vicky Morrison, Adam Kurkiewicz, Dana Vuzman, and Peter Kharchenko. The tool itself is Cellenics®, an open-source single-cell analysis toolkit from Harvard Medical School: https://github.com/hms-dbmi-cellenics. Pre-filtered count matrices were uploaded to Cellenics®. Barcodes were then filtered in a series of four sequentially applied steps. Barcodes with less than 500 UMIs were filtered out. Dead and dying cells were removed by filtering out barcodes with a percentage of mitochondrial reads above 15%. To filter outliers, a robust linear model was fitted to the relationship between the number of genes with at least one count and the number of UMIs of each barcode using the MASS package (v. 7.3-56)71. The expected number of genes was predicted for each barcode using the fitted model with a tolerance level of 1 − α, where α is 1 divided by the number of droplets in each sample. Droplets outside the upper and lower boundaries of the prediction interval were filtered out. Finally, the probability of droplets containing more than one cell was calculated using the scDblFinder R package v. 1.11.372. Barcodes with a doublet score greater than 0.5 were filtered out. After filtering, each sample contained between 300 and 8000 high-quality barcodes and was input into the integration pipeline. In the first integration step, data were log-normalized, and the top 2000 highly variable genes were selected based on the variance stabilizing transformation (VST) method. Principal-component analysis (PCA) was performed, and the top 40 principal components, explaining 95.65% of the total variance, were used for batch correction with the Harmony R package73. Clustering was performed using Seurat’s implementation of the Louvain method. To visualize results, a Uniform Manifold Approximation and Projection (UMAP) embedding was calculated, using Seurat’s wrapper around the UMAP package74. To identify cluster-specific marker genes, cells of each cluster were compared to all other cells using the presto package implementation of the Wilcoxon rank-sum test73. Keratinocytes were subset from the full experiment by extracting manually annotated barcodes and filtering the Seurat object. The subset samples were subsequently input into the Biomage-hosted instance of Cellenics®. Filtering steps were disabled since the data was already filtered. The data was subjected to the same integration pipeline as the full experiment. All cells were manually annotated using available literature and CellTypist75.

Annotated cell-level data were downloaded from Cellenics in the form of an .rds file containing a Seurat object. The data was converted by exporting count matrices and metadata from R and loading them using Scanpy version 1.9.3 (https://scanpy.readthedocs.io/). Additional metadata (e.g., age, sex, self-reported ethnicity) from the original datasets were matched to the closest entries in the respective ontology, per CELLxGENE contribution guidelines (https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data). The final CELLxGENE dataset can be found at https://cellxgene.cziscience.com/collections/71f4bccf-53d4-4c12-9e80-e73bfb89e398.

Cells (all, keratinocytes, fibroblasts, vascular clusters; Supplementary Fig. 1) were grouped using lasso tools to allow for the pseudobulk RNA sequencing analyses using these Tier 1 annotations. Differentially expressed gene (DEG) lists and volcano plots (volcano plot statistical significance measured as a p value [i.e., ANOVA] log2 fold change) were generated in Cellenics® and exported as .csv files and uploaded to the g:Profiler website (https://biit.cs.ut.ee/gprofiler/gost). g:Profiler is part of the ELIXIR Recommended Interoperability Resources that support FAIR principles. A complete list of those resources can be found: https://elixir-europe.org/platforms/interoperability/rirs. g:Profiler assesses Gene Ontology and pathways from KEGG Reactome and WikiPathways. DEGs were uploaded to the query section and were first analyzed using g:GOSt multi-query Manhattan plots. These data were further analyzed for the results tab (GO:MF, GO:CC, GO:BP, KEGG, REAC, TF, MIRNA, HPA, CORUM, HP, WP). Data from Supplementary Fig. 2 are an incomplete display of all the g:Profiler data. DEGs are included in Supplementary Data 1 for further analysis.

The total number of ligand–receptor interactions between Tier 3 cell types was calculated for healthy and periodontal disease using CellPhoneDB (version 3.1.0). The Tier 3 annotated AnnData object was subsetted and separate AnnData objects were saved for healthy and periodontal disease, respectively. Metadata tables containing the cell barcodes as indices were also exported. CellPhoneDB was then run as follows: cellphonedb method statistical_analysis metadata.tsv AnnData.h5ad --iterations = 10 --counts-data hgnc_symbol --threads = 2. The CellPhoneDB results were filtered by removing those interactions with a P value > 0.05. Results were visualized using a modified form of CellPhoneDB’s plot_cpdb_heatmap function to allow for re-ordering of cell types. Cell-cell interactions between receptors in Tier 4 keratinocyte subtypes and ligands in innate and adaptive immune subtypes were further explored using the R package CellChat (version 1.6.1 using the cell-cell interaction database). The Tier 4 annotated AnnData object was subsetted and separate expression matrices exported for healthy and periodontal disease, together with their respective metadata tables. These were used to create Seurat objects for health and periodontal disease, which served as inputs to CellChat. Analyses were performed using the log-transformed normalized gene counts with default parameters and using the human CellChatDB. Cell type composition differences were accounted for when calculating communication probabilities. Data from health and disease were compared to identify significant changes.

The keratinocyte subset count matrices were imported into Scanpy version 1.9.3 to conduct quality control, normalization, and log transformation of the data to control for variability in sequencing depth across cells. To minimize the potential batch effects across the four datasets, a batch correction technique was applied using the Python package HarmonyPy version 0.0.973, with the ‘sample ID’ serving as the batch key. PAGA graphs were constructed using Scanpy’s implementation. These graphs were used to explore the relationships between different clusters of cells and to understand the potential developmental trajectories. The coordinates for UMAP80 were then calculated with the PAGA graph as the initial position, allowing for a visualization that is coherent with the topology of the PAGA graph. To better understand the developmental progression of cells along these trajectories, pseudotimes were estimated by diffusion pseudotime (DPT) analysis81 over the PAGA graphs. The DPT is a measure of the transcriptional progression of cells along a trajectory, starting from root cells that were manually selected. Heatmaps were created to visualize gene expression changes along the trajectories, with manually selected start and endpoints, using both Scanpy and Seaborn82. To smooth the plots and reduce noise, a moving average of the expression values was used, with a window size of 50 data points along pseudotime. The clustering of the genes in the heatmaps was performed using Ward’s method.

All the necessary animal procedures were followed according to the UK law, Animals Scientific Procedures Act 1986. The experiments were covered by the necessary project licenses under the Home Office and Queen Mary University of London’s institution’s Animal Welfare & Ethical Review Body (AWERB). The mouse tissues were obtained at Queen Mary University of London, Barts & The London School of Medicine and Dentistry. Mice from both genders were maintained on the C57BL/6N genetic background and were housed under a 12-h light/12-h dark cycle, at temperatures of 20–24 °C with 45–65% humidity. Single-cell suspensions of gingival tissue were obtained from P28 mice, sacrificed by cervical dislocation. Three biological replicates were pooled together to give one single sample for sequencing. Both males and females were used. Fresh gingival tissues were processed immediately after dissection, cut into smaller pieces in a sterile petri dish with RPMI medium (#11875093, Sigma) and digested for 30 min at 37 °C under agitation using the Miltenyi Mouse-Tumor Dissociation kit (#130-096-730). The resulting cell suspension was consecutively filtered through 100 µm and 70 µm cell strainers and cells were collected by centrifugation. The viability of the cell suspension was determined using a Luna-FL automated cell counter (Logos Biosystems). Single-cell cDNA library was prepared using the 10x Genomics Chromium Single-cell 3’ kit (v3.1 Chemistry Dual Index). The prepared libraries were sequenced on Illumina® NovaSeq™6000 (2 × 150 bp) with a targeted sequencing depth of ~30,000 reads/cell. The cell ranger-6.0.1 pipeline was used for processing the scRNAseq data files before analysis according to the instructions provided by 10x Genomics. Briefly, base call files obtained from each of the HiSeq2500 flow cells used were demultiplexed by calling the “cellranger mkfastq”. The resulting FASTQ files were aligned to the mouse reference genome (GRCm38/mm10), filtered, and had barcodes and unique molecular identifiers counted and count files generated for each sample. The raw count matrix output from CellRanger was then processed by the ambient RNA removal tool CellBender83, giving an output-filtered count matrix file. This was used for subsequent preprocessing and data analysis using Python package 3.8.13 with the Scanpy pipeline84. For basic filtering of our data, we filtered out cells expressing less than 200 genes and less than 100 counts. We filtered out genes expressed in less than three cells and with less than 10 counts. Cells were filtered out by applying the following thresholds: 1) more than 20% mitochondrial reads; 2) ribosomal reads lower or higher than the 5th and 95th percentile; 3) more than 1% of hemoglobin reads and 4) total reads lower than 700 and higher than 50,000. Scrublet, a doublet removal tool was applied to further remove predicted doublets85. To ensure that the data was comparable among cells, we normalized the number of counts per cell to 10,000 reads per cell. Data were then log-transformed for downstream analysis and visualization. The cell cycle stage was predicted using the sc.tl.score_genes_cell_cycle tool86. We regressed out cell-to-cell variations driven by mitochondrial, ribosomal, and cell-cycle gene expression and the total number of detected molecules. We then scaled the data to unit variance. The neighborhood graph of cells was computed using PCA presentation (n PCs = 40, n neighbors = 10). The graph was embedded in two dimensions using UMAP as suggested by Scanpy developers. Clusters of cell types were defined by the Leiden method for community detection on the generated UMAP graph at a resolution of 0.1. Epithelial clusters were used for the second-level clustering. The respective cell types were identified upon annotation of clusters from first-level clustering. The cluster-specific barcodes were retrieved as a list, which was used to select the cells of interest from the filtered count matrix on a separate Jupyter notebook and were re-analyzed separately. Epithelial cells were filtered and analyzed as previously described, and clustered at resolution 0.5 using Louvain.

The standard Kraken2 database (version 2.1.3) was downloaded. To avoid overlooking potential oral microbes, genomes from the Human Oral Microbiome Database (HOMD) (https://homd.org) not present in the standard database (n = 1,502 taxIDs) were also downloaded, and this custom database was built using kraken2-build88 with default parameters. Reads from were taxonomically classified using Kraken 2, with “–use-names” and “–report-minimizer-data” (Kraken2Uniq) but otherwise default parameters. True positives from Kraken2 results were identified using barcode level denoising from the SAHMI pipeline and rRNA enrichment. First, barcode denoising was performed. True taxa were identified by performing Spearman correlations between the number of unique and total k-mers across barcodes in each sample. Taxa found to significantly correlate (p value < 0.05) in at least one sample were retained. For all retained taxa, genomic contigs belonging to the taxa were extracted from the Kraken2 database and the reads that were classified to that specific taxa were then mapped to those genomic contigs using bowtie2 (version 2.2.5)89 with default parameters. Additionally, rRNAs were annotated along the extracted genomic contigs using barrnap (version 0.8) with default parameters. BEDTools (version 2.30.0)90 coverage was used to count the number of aligned reads overlapping annotated rRNAs. We then calculated the fold enrichment of reads across rRNAs relative to the entire genome, normalized by rRNA and genome length, respectively. Taxa found to contain at least a fivefold enrichment in rRNA sequences relative to the whole genome, which is expected for bacterial transcriptomics that is not rRNA depleted, were retained. From the human host reads, we previously identified which barcodes corresponded to which cell types. Because the reads that are classified to microbial taxa also contain these same barcodes, we then assigned cell types to the microbial reads. To calculate the relative abundance of taxa in each cell type, we divided the total number of reads classified to those taxa with a barcode assigned to that cell type by the total number of reads in the sample assigned to that cell type.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A