Computational Methods

AJ Andrew L. Ji
AR Adam J. Rubin
KT Kim Thrane
SJ Sizun Jiang
DR David L. Reynolds
RM Robin M. Meyers
MG Margaret G. Guo
BG Benson M. George
AM Annelie Mollbrink
JB Joseph Bergenstråhle
LL Ludvig Larsson
YB Yunhao Bai
BZ Bokai Zhu
AB Aparna Bhaduri
JM Jordan M. Meyers
XR Xavier Rovira-Clavé
SH S. Tyler Hollmig
SA Sumaira Z. Aasi
GN Garry P. Nolan
JL Joakim Lundeberg
PK Paul A. Khavari
request Request a Protocol
ask Ask a question
Favorite

Gene-barcode counts matrices were analyzed with the Seurat R package (version 2.3.0). Cells with < 200 genes detected and > 10% mitochondrial gene mapped reads were filtered from downstream analyses. All samples were merged with the AddSamples function into one Seurat object. The merged Seurat object was normalized and scaled by regressing out UMI count and percentage of mitochondrial genes. For dimensionality reduction, the most variable genes were determined using the FindVariableGenes function with parameters 0.0125 < mean of non-zero values < 3 and standard deviation > 0.5. Dimensionality reduction was then performed using PCA and UMAP plots were generated by the RunUMAP Seurat function (Seurat version 3.0.0) with the first 13-15 PCs as input, determined by visualizing the drop off in PC variance explained using the ElbowPlot function in Seurat.

In order to determine cell types, we combined unsupervised clustering and differential expression to compare top differentially expressed genes with cell type specific expression known from literature. Through this approach, we confidently identified broad categories among all cells, and further delineated cellular subtypes by isolating subsets (through in silico “gating”) of broadly defined cell types and re-analyzing with the same approach. For broad cell type annotation shown in Figure 1B, low-resolution clustering was performed using the FindClusters function with resolution 0.1 with the first 15 PCs to generate 12 clusters. Differential expression was performed using the FindAllMarkers function in Seurat with default parameters. Four of these clusters (clusters 0, 2, 3, and 5) highly expressed epithelial-associated genes such as KRT5, KRT14, KRT1, and KRT10 and were therefore merged and inferred to be epithelial cells. Myeloid cells made up two clusters (clusters 1 and 4) and highly expressed LYZ and MHC class II genes (e.g., HLA-DRB1, HLA-DRA, and HLA-DQB2). Other clusters highly expressed markers specific for T cells (CD3D, CD2, and CD7; later determined to also include NK cells after re-clustering), fibroblasts (COL1A1, COL1A2, LUM), melanocytes (MLANA, DCT, PMEL), endothelial cells (TFF3, CLDN5, VWF), and B/plasma cells (IGLL5, IGJ, MS4A1, CD79A).

Myeloid cell subset annotations were determined by isolating their respective clusters, repeating dimensionality reduction and unsupervised clustering on these smaller groups of cells. For myeloid cell types, clustering at resolution 0.6 followed by differential expression demonstrated clusters high in the monocyte marker CD14, which were further re-clustered into macrophages (CD163, CD68) and MDSCs (S100A8, S100A9, TREM1). The remaining subsets were re-clustered into CD1C DCs (CD1C, CLEC10A), CLEC9A cells (CLEC9A, CADM1, XCR1), Langerhans cells (CD207, CD1A, S100B), AS DCs (AXL, IGFBP5, PPP1R14A), plasmacytoid DCs (CLEC4C, IL3RA), and migrating DCs (CCR7, CCL19) (Bronte et al., 2016, Kiss et al., 2018, Villani et al., 2017). Migrating DCs were further isolated, re-clustered into three groups, which were inferred to be CD1C DCs, CLEC9A DCs, and Langerhans cells based on similar expression to aforementioned markers.

T cell subset annotations were determined by further regressing patient identity using the ScaleData function in Seurat, and re-clustering the T cell cluster at higher resolution (2.0) to obtain 11 clusters. These clusters corresponded to the identities in Figure 1D and gene expression heatmap in Figure S4B, inferred from previous studies (Guo et al., 2018, Savas et al., 2018, Tirosh et al., 2016a, Zhang et al., 2018, Zheng et al., 2017). During all aforementioned re-clustering for mononuclear and T cell subsets, we also observed small clusters of cells co-expressing keratinocyte markers such as KRT5 or KRT1 (for both mononuclear and T cells), or mononuclear markers such as LYZ (for T cells), which we inferred to be doublets and removed from further analysis.

Cell cycle regression was performed using the CellCycleScoring function in Seurat to obtain S and G2/M phase scores for single cells, followed by the ScaleData function to regress out S and G2/M scores, and re-clustering to obtain the appropriate number of clusters. Proportion of cycling cells were obtained by calculating the number of previously identified cycling cells (pre-regression) that clustered with new clusters post-regression divided by the total number of cells in the post-regression clusters.

Keratinocytes were isolated from the epithelial clusters in the broad cell type UMAP (Figure 1B) and repeating dimensionality reduction with PCA and clustering at resolution 0.6. We identified putative pilosebaceous clusters based on expression of known hair follicle markers or transcription factors (KRT15, LHX2, SOX9) and prior mouse and human studies in addition to manual cross-referencing of the Human Protein Atlas (Joost et al., 2016, Purba et al., 2015, Uhlén et al., 2015) and removed these from further analysis. We separated normal keratinocytes from tumor keratinocytes based on sample origin and additionally removed a small number of cells coming from tumor samples that clustered with normal cells and vice versa. Normal keratinocytes were re-scaled with regression of patient identity and re-clustered into three clusters with representation from all patients (Figures 2A, 2B, and andS2S2D).

Among the eight patients with overlapping clusters of tumor keratinocytes, we employed canonical correlation analysis (CCA) to align the patient data using the Seurat package (version 2.3.0). We recovered too few cells from Patient 3 to complete CCA, and these cells were removed from further analysis. For the cells from the remaining seven patients, we performed clustering (resolution 0.25) and UMAP with the first 13 CCA components in Seurat (version 3.1.3). Cell cycle proportion analysis was performed as described above for other subsets of cells. Upon cell cycle regression of normal keratinocytes, a small cluster of inferred cycling pilosebaceous keratinocytes emerged and was removed from additional analysis. Differential expression was performed using the FindAllMarkers function in Seurat with default parameters. Genes filtered for adjusted p value < 0.05 (Wilcoxon Rank Sum test) were included in gene marker overlap analysis (Figure 2D; Table S2). Gene ontology (GO) analysis was performed using DAVID v6.8 (Huang et al., 2009) on the top 200 differentially expressed genes (adjusted p value < 0.05 by Wilcoxon Rank Sum test). GO terms shown in Figure 2E are enrichment at false-discovery rate (FDR) < 0.15 in ≥ 1 subpopulation. Hallmark EMT signature scoring was performed using the AddModuleScore function in Seurat with default parameters. SCENIC analysis was performed on tumor keratinocytes with recommend parameters from tutorials available on the SCENIC Github page (https://github.com/aertslab/SCENIC) (Aibar et al., 2017).

Xenograft single-cell scoring was performed using the AddModuleScore function with default parameters in Seurat with the top 85-100 significantly differentially expressed genes in tumor keratinocyte subpopulations for basal, cycling, differentiating, or TSK tumor keratinocyte subpopulations (Table S2) after pre-processing data in a similar manner as patient data. During pre-processing, a small cluster of SCC-13 xenograft tumor cells expressing eccrine-associated genes could not be ruled out as possible murine cell contamination, thus were removed from further analysis. Unbiased clustering was performed in Seurat to generate four clusters from SCC-13 and CAL27 SCC cell line datasets (at resolution 0.15). Differential expression was then performed across these clusters and an overlap correlation matrix was generated from a binarized matrix of all DEGs with logFC > 0.25 and adjusted p value < 0.05 using the FindAllMarkers function in Seurat. For murine cells, datasets from SCC-13, CAL27, and A-431 tumors were merged and unbiased clustering was performed to generate DEG lists of markers, which were used to annotate cell types based on literature as described above for patient data cell annotation. These DEGs were converted to human genes and used to generate an overlap correlation matrix with analogous patient TME cell type DEGs (Figure 7E). Select shared DEGs from both patient and murine data are shown in Figures S7A and S7B. Because too few A-431 SCC cells were recovered from sequencing to perform similar analysis as the other two cell lines, we omitted murine cells derived from A-431 tumors in plots of xenograft TME data.

The gene-spot matrices generated after ST data processing from ST and Visium samples were analyzed with the Seurat package (versions 3.0.0/3.1.3) in R in addition to custom scripts. For each patient data, spots were filtered for minimum detected gene count of 200 genes while genes with fewer than 10 read counts or expressed in fewer than 2 spots were removed. Normalization across spots was performed with the SCTransform function with regression of replicate and number of genes per spot. Dimensionality reduction and clustering was performed with independent component analysis (ICA) at resolution 0.8 with the first 20 ICs. For merged data shown in Figures S3B and S3C, all replicate samples across the four patients were merged with the “merge” Seurat function and re-normalized with SCTransform (regressing replicate and number of genes per spot) prior to ICA and UMAP on the first 20 ICs.

The spatial cluster gene signature overlap correlation matrix was generated by first taking all genes differentially expressed (average logFC > 0.25 and adjusted p value < 0.05 by Wilcoxon Rank Sum test; Table S4) across all ST clusters for each patient and generating a binarized matrix of genes and ST clusters (whether a gene belonged to the cluster or not). Pearson correlation was calculated across this matrix and hierarchical clustering of ST clusters was performed with the heatmap.2 function in the gplots package in R. Signature scoring derived from scRNA-seq or ST signatures was performed with the AddModuleScore function with default parameters in Seurat.

Nearest neighbor analysis was performed by counting the cluster identities of the four nearest neighbors on replicate sections for each ST section in the same patient. A null distribution of neighboring spots was determined by shuffling the cluster identities and counting nearest neighbor identities across randomized data for a total of 1,000 permutations while preserving number of spots and cluster identities per tissue section. The P value was the number of randomized permutations exceeding the observed data.

Spatial feature expression plots were generated with the SpatialFeaturePlot function in Seurat (version 3.1.3) and the STUtility R package (version 1.0.0).

We reasoned that genes expressed in adjacent spots in ST were potentially meaningful and that a simple correlation of genes across spots could overlook this adjacency structure within the data. Thus, we calculated average normalized gene expression across a “sliding window” of spot groups consisting of a central spot surrounding by its N nearest neighbors, where N = 4 in the original ST data and N = 6 in Visium samples (illustrated in Figures S3A and S3J) for each spot in the tissue, generating a matrix of genes by average spot group expression across all spots. This matrix can be correlated with any “anchoring” gene of interest (FOXP3 in our case) by calculating pairwise Pearson correlations of the FOXP3 expression vector across all spots and the gene average group expression vectors across spots. These values reflect if the expression of a gene in the area surrounding the anchoring gene is correlated with the expression of the anchoring gene and termed “spatial gene correlation” with FOXP3 as shown in Figures S5F and S5G.

Raw reads were aligned to hg19 using BWA with default parameters and duplicate reads were removed, resulting in a mean sequencing depth of 158x for on-target regions. The resulting aligned reads were processed using GATK 4.1.0.0 and Mutect2 variant calling for tumor-normal matched pairs, mostly following parameters from the GATK 4 documentation (https://gatk.broadinstitute.org). Mutect2 was run using a panel of normals derived from our cohort and the HapMap 3.3 allele frequencies as a germline resource. The resulting VCF files were filtered using FilterMutectCalls and FilterByOrientationBias and finally annotated using Funcotator, all within GATK tools.

Mass-spectrometry pixel data was extracted into a multi-dimensional TIFF as previously described (Keren et al., 2018), using custom MATLAB (Mathworks, Inc., Natick, MA) scripts publicly available (https://github.com/lkeren/MIBIAnalysis). Time of flight (TOF) data was calibrated using sodium (Na, mass 22.99) and gold (Au, mass 196.97) peaks.

Background subtraction of noise using bare regions of the slide with no tissues (high in Au counts) was performed as previously described (Keren et al., 2018).

To filter noise by signal density a k-nearest-neighbor approach was used (Keren et al., 2018). In short, for each count, the average distance to the 25 nearest positive counts was counted. Pixels with counts larger than one were treated as several counts with distance 0 from each other. Removal thresholds were determined as the crossing points in the bimodal distributions and low-confidence counts were removed.

MIBI images were converted to the MIBITiff format using custom scripts and uploaded onto the MIBItracker (V1.1.6.3, Ionpath Inc) for viewing. These images will be publicly available upon publication.

Nuclear segmentation was performed with DeepCell (Van Valen et al., 2016) (https://github.com/vanvalenlab/deepcell-tf) as previously described (Keren et al., 2018), using in-house training and validation MIBI imaging datasets from a variety of human clinical tissue types to ensure reproducibility.

Feature counts per cell from the nuclear segmentation were extracted and normalized by cell size, before an arcsinh transformation with cofactor 1. A number of factors can potentially affect the staining and ion extraction efficiency on the MIBI-TOF, such as tissue sample fixation and preparation conditions, the Z-height of the samples, and local antibody concentrations. To adjust for these variations, we assumed that the median counts for the nucleus marker, dsDNA, should be fairly consistent across FOVs and patient tissue samples. All marker counts were normalized by the median dsDNA signal per FOV, with a cutoff of dsDNA > 0.1 to remove cells with minimal staining of dsDNA.

Marker expression was then scaled between the 10th to 90th percentiles, and then normalized on a 0 to 1 scale. Cell types were identified first using FlowSOM (Van Gassen et al., 2015) followed by Marker Enrichment Modeling (MEM) (Diggins et al., 2017), before visualizing using Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018).

For Figure 5B, the subset of markers consistent between MIBI and scRNA-seq for expression across cell types was presented (scRNA-seq data in Figure S5B). Hierarchical clustering in Figures 5C and 5D were performed using the R function heatmap.2 from the gplots package. To assess co-localization of pairs of non-tumor cell types, each FOV was analyzed to calculate the mean distance between each cell of cell type A and all cells of cell type B. For example, in Figure 5F, for each CD8 T cell, the mean distance to Tregs was calculated. To determine the significance of the distribution, the distance was calculated between each cell of cell type A and a random selection of non-tumor (and not A or B type) cells. This random selection was repeated 1,000 times, and a false discovery rate of observing a median distribution as extreme as the real observed distribution was calculated.

The identification of non-tumor cell location categories as tumor-infiltrated, tumor-stroma border (leading edge), or stromal was based on determining the identity of the 30 nearest neighbors to each non-tumor cell. The proportion of nearest neighbors flagged as tumor cells determined which category each cell fell: a cell having zero tumor cell neighbors flagged it as “stromal,” having between 5 and 13 tumor cell neighbors flagged it as “leading edge,” and having greater than 19 tumor cell neighbors flagged it as “infiltrated.”

Ligand-receptor interactions were inferred using a similar approach as previously described (Vento-Tormo et al., 2018). We first calculated average expression of ligand and receptor pairs across cell type pairs in normalized scRNA-seq data from an aggregate of the seven patient tumor samples containing TSK cells. We only considered genes with more than 10% of cells demonstrating expression within each cell type considered. We calculated a null distribution for average ligand-receptor by shuffling cell identities in the aggregated data and re-calculating ligand-receptor average pair expression across 1,000 permutations of randomized cell identities. The P value was the number of randomized pairs exceeding the observed data. For bar plots shown in Figures 6B and 6C, in addition to including only ligand-receptor pairs with p < 0.001, we further thresholded individual ligand or receptor expression with a cutoff of average expression > 0.2 (in log space). The 0.2 cutoff was determined by calculating the average log gene expression distribution for all genes across each cell type, and genes expressed at or above this cutoff corresponded with the top 12% or higher of expressed genes for each cell type.

For NicheNet (Browaeys et al., 2019) analysis, we derived TME cell type signatures by taking the top 100 differentially expressed genes in cells isolated from tumors or normal skin, including B cells, endothelial cells, fibroblasts, Langerhans cells, plasmacytoid DCs, CD1C DCs, CLEC9A DCs, T cells, NK cells, macrophages, and MDSCs. We input these signatures into NicheNet to derive a union set of predicted ligands modulating tumor-specific TME cell type signatures. For ligands predicting TSK modulation, we input the top 100 TSK-differentially expressed genes (Table S3). The top 15% of predicted ligands by regulatory potential that also demonstrated significance in our scRNA-seq ligand-receptor interaction analysis (described above) in each case are shown in Figures 6D and 6F. For differential expression testing of ligands and receptors in heatmaps from 6D and 6F, we used the FindAllMarkers function in Seurat to generate average logFC values per cell type compared to other cell types from the scRNA-seq data.

For ligand-receptor spatial transcriptomic proximity analysis, the average value of all ligand-receptor pairs across the leading edge from the eight sections from patients 2, 4, and 10 were calculated first by averaging the ligand and receptor expression among each leading edge spot and its 4-6 nearest neighbors (depending on ST technology), and then taking the average values of all of these groups of five or seven spots across the leading edge. This calculation for each ligand-receptor pair was then performed on 1,000 randomized permutations of spot identities while preserving total number of spots per replicate section to generate a null distribution per patient. P value was calculated by number of randomized permutation calculations that exceeded the true average.

Subpopulation-enriched and SCENIC-nominated genes were selected based on analysis from patient scRNA-seq data (Figure S2F; Table S2). Merging patient and xenograft scRNA-seq helped select highly expressed genes across both datasets and exclude lowly expressed xenograft genes. Genes were determined to be subpopulation-enriched in patient scRNA-seq data if either one of two criteria were met: 1) recommended Seurat thresholds of average logFC > 0.25 and adjusted p value < 0.05 (Wilcoxon Rank Sum test) when compared to other subpopulations, or 2) average expression Z-score difference > 0.25 over any other subpopulation and adjusted p value < 0.05 (Wilcoxon Rank Sum test) when compared to other subpopulations. Genes not meeting these criteria were considered broadly expressed. Library guides were designed with Graphical User Interface for DNA Editing Screens (GUIDES) (guides.sanjanalab.org). The final library contained 334 genes at a coverage of eight guides per gene as well as 136 non-targeting control guides (Table S6).

Sequencing reads were first clipped using fastx clipper (from the FASTX-Toolkit package) for the sequence TCTTGTGGAAAGGACGAAACACCG. PCR duplicates were collapsed using the nine degenerate nucleotides added in PCR 1. Reads were then trimmed to the spacer sequence using fastx trimmer and mapped to the guide library using bowtie2. A read counts table was generated and used for downstream analysis.

The resulting matrix of read counts for each sgRNA (rows) and each sample (column) was then depth normalized so that each sample summed to ten million reads. These counts were log2-normalized, and the values for the time point 0 for each sgRNA were subtracted from time point 1 (three weeks of culture for in vitro and 8-10 weeks of xenograft growth for in vivo). Then for each sample, the median value of non-targeting sgRNAs was subtracted from all gene-targeting sgRNAs. These values are presented as “log2FC” in Figure 7 and Figure S7. We used the STARS algorithm to determine the false discovery rate for each gene (Doench et al., 2016). The log2FC tables were averaged across samples for each sgRNA and the resulting table was used as input to STARS, which was run with default parameters.

The significance of the difference of log2FC observed in in vitro and in vivo screens was determined using a permutation false discovery test (Figure S7D). On a gene by gene basis, the log2FC values for each in vivo tumor and in vitro biological replicate were permuted one thousand times. Then the number of instances of the permuted data exhibiting a more extreme difference between in vivo and in vitro was calculated, and this proportion was reported as the FDR.

Co-essential genes were determined from Wainberg et al., 2019, which scored co-essentiality of genes across genome-wide CRISPR screens in 485 cancer cell lines (Meyers et al., 2017) using a generalized least-squares regression. To build our network, we included all co-essential gene pairs at their previously determined FDR < 0.10 that were non-syntenic. We seeded the network using any hit that was depleted in at least one of our screens at FDR < 0.10 as determined by the STARS algorithm. We included any other gene in the network that was annotated as co-essential with at least two of the seed genes. For network visualization, genes were represented by nodes, while edges represent a co-essential relationship at FDR < 0.10.

To determine the average proportion of samples of each tumor type exhibiting copy number variation of every gene, we first downloaded the Gistic2 gene-level thresholded copy number tables from the UCSC Xena database (https://xenabrowser.net/datapages/). We analyzed a collection of 31 solid tumors including ACC, BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, MESO, OV. PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, TGCT, THCA, THYM, UCEC, UCS, UVM. For each tumor type, we calculated the proportion of tumors for each gene with a score less than or greater than 0 indicating copy number loss or gain, respectively. We then averaged these proportions across all tumor types to generate the values plotted in Figure S7G.

We analyzed correlations between inferred TSK abundance and individual TME cell type abundances first by identifying a set of genes for each cell type using the average expression of genes for that cell type in our scRNA data. Genes were considered specifically expressed if they exhibited an expression value greater than 0.8 (Seurat normalized values) in the cell type of interest and less than 0.2 in all other cell types (Figure S6D). These gene sets were then used to calculate a cell type module score, which was computed finding a set of 100 background genes in the same expression bin (the range of expression of all genes divided by 20) for each gene in the signature, and subtracting mean expression of this background gene set from the mean expression of the cell type gene set in each TCGA bulk tumor. We then calculated the Pearson correlation between the module score of a curated specific (Seurat p < 1e-10) TSK gene set (MMP10, PTHLH, LAMC2, SLITRK6) and the module scores of each TME cell type in each tumor type to produce the values presented in Figure S6E.

We interrogated coupled CNV and expression data to identify genes that exhibited genetic evidence of altered CAF abundance by first identifying tumor samples with copy number loss (Gistic2 thresholded value less zero, as in the CNV analysis above) versus samples with stable copy number status for each gene (Gistic2 value of zero). We also calculated the inferred CAF abundance by computing the module score for the CAF signature for each tumor, as described above. For each gene, we then subtracted the average CAF module score of CNV-loss tumors from CNV-stable tumors to determine the effect of CNV loss on CAF abundance. We also calculated the correlation between the expression of the same gene and the CAF module score, and these values (for each gene across all 31 tumor types) are presented in the plot in Figure 6H. To identify genes with statistically significant values of CNV-loss scores or correlations, we permuted the tumor sample IDs 100 times and calculated the range of CNV-loss scores and correlations in each permutation. We flagged genes as significant with values more extreme than 95% of the permuted values to establish a false discovery rate of 5%. Genes highlighted in the plot in Figure 6H are the subset of statistically significant genes that were predicted as TSK-specific ligands interacting with CAFs by the NicheNet analysis.

Kaplan-Meier plots shown in Figure S7H were generated using the Survival package in R. Tumors from TCGA types shown were scored with the average expression of the TSK gene set (MMP10, PTHLH, LAMC2, SLITRK6). Patients in the top and bottom 20% of scores were designated as TSK high and low, respectively. P value was calculated using a chi-square test.

Gene expression and clinical data were downloaded from the supplemental data of Prat et al., 2017. Samples corresponding to head and neck SCC (HNSC) and squamous non-small cell lung cancer (LUSC) were analyzed for the mean expression of the two TSK-biased genes found in the select genes probed in this study: ITGB1 and PLAU. Samples were stratified into high or low expression groups of this gene set (top and bottom 50% of samples, respectively, across all HNSC and LUSC samples). The progression-free survival after anti-PD-1 treatment was then analyzed using the Survival package in R to generate the Kaplan-Meier plot shown in Figure 6I and the p value was calculated using a chi-square test.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A