Methods and Protocols

Alok Jaiswal; Prson Gautam; Elina A Pietilä; Sanna Timonen; Nora Nordström; Yevhen Akimov; Nina Sipari; Ziaurrehman Tanoli; Thomas Fleischer; Kaisa Lehti; Krister Wennerberg; Tero Aittokallio

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Methods and Protocols

AJ Alok Jaiswal

PG Prson Gautam

EP Elina A Pietilä

ST Sanna Timonen

NN Nora Nordström

YA Yevhen Akimov

NS Nina Sipari

ZT Ziaurrehman Tanoli

TF Thomas Fleischer

KL Kaisa Lehti

KW Krister Wennerberg

TA Tero Aittokallio

This method is extracted from research article: Mol Syst Biol, Mar 2021

Multi‐modal meta‐analysis of cancer cell line omics profiles identifies ECHDC1 as a novel breast tumor suppressor

DOI: 10.15252/msb.20209526

Request a Protocol

Ask a question

Favorite

The Broad institute is carrying out a number of large‐scale cell line profiling projects such as the Cancer Cell Line Encyclopedia (CCLE) (Barretina et al, 2012; Ghandi et al, 2019), Cancer Dependency Map (DepMap) (Tsherniak et al, 2017; Meyers et al, 2017), and Cancer Therapeutic Response Portal (CTRP) (Basu et al, 2013; Seashore‐Ludlow et al, 2015). Specifically, we reused the point mutation profiles of coding genes among 1,570 cancer cell lines from the DepMap project (DepMap Broad, 2019). We included point mutations that were either categorized as pathogenic by the authors or had FATHMM (Shihab et al, 2013) score ≤ −0.75 and binarized the genes for presence or absence of a mutation with a functional consequence. The processed copy number profiles of 1,080 cancer cell lines, generated using the Affymetrix SNP 6.0 arrays, were obtained from CCLE (DepMap Broad, 2019). Gene‐level copy number gain and losses were called using a stringent threshold of ≥ 1 and ≤ −1, respectively. Genome‐wide transcriptomic profiles for protein‐coding genes generated with RNA sequencing for 1,156 cancer cell lines were obtained from the DepMap resource (DepMap Broad, 2019). Likewise, protein phosphorylation levels of 217 proteins in 899 cancer cell lines profiled using reverse phase protein arrays (RPPA) were obtained from the CCLE resource (Ghandi et al, 2019) and averaged for each protein. Quantitative proteomic profiles for 375 cell lines were generated using TMT‐labeled multiplexed protocol for sample preparation (Nusinow et al, 2020). We re‐analyzed both the bridge‐normalized and non‐bridge‐normalized proteome profiles. For methylation profiles, we used averaged gene‐level methylation profiles of promoters situated 1 kb upstream of transcription start sites for all coding genes in 843 cancer cell lines generated using reduced representation bisulfite sequencing (RRBS) method), as provided by the authors of the study. Since the genome‐wide CpG level data were not available, we used the original criteria for defining the promoter (Ghandi et al, 2019). For functional profiles, we used loss‐of‐function data from the Achilles Project that was generated with a pooled genome‐wide shRNA screening of 501 cancer cell lines (Tsherniak et al, 2017; DepMap Broad, 2019). Gene dependency scores of each coding gene were estimated using the DEMETER2 algorithm (Tsherniak et al, 2017). We also re‐analyzed gene dependency scores based on the genome‐wide CRISPR‐Cas9 knockout screens performed using various pooled sgRNA libraries from the DepMap portal. The Avana library was screened in 485 cancer cell lines (Meyers et al, 2017), the 120K sgRNA GeCKO v2 library was screened in 33 cancer cell lines (Aguirre et al, 2016), and the Sabatini library was screened in 15 acute myeloid leukemia (AML) cell lines (Wang et al, 2017). All the raw data for the knockout screens were processed by the Ceres algorithm (Aguirre et al, 2016) and downloaded from the DepMap or Achilles data portal. Drug response profiles of cancer cell lines for CCLE and CTRP v2 were obtained from the PharmacoDB database where the cell line identifiers were pre‐harmonized (Smirnov et al, 2018), and the drug‐induced viability response was estimated using the Drug Sensitivity Score (DSS₂), previously developed in the group (Yadav et al, 2014). While CCLE screened 24 compounds against 504 cell lines, the CTRP v2 dataset was generated by screening 544 compounds against 887 cell lines. Both of the drug sensitivity screens were based on CellTiter‐Glo (CTG) assay to measure cell viabilities.

The Sanger Institute has also carried out several studies for molecular characterization of cancer cell lines, performed under the Genomics of Drug Sensitivity in Cancer (GDSC) project (Garnett et al, 2012; Yang et al, 2012; Iorio et al, 2016; Van Der Meer et al, 2018). We re‐analyzed mutational profiles of 1,000 cancer cell lines for coding genes from the COSMIC Cell Lines project generated by whole genome sequencing (Bamford et al, 2004). We selected mutations that were categorized as pathogenic or had FATHMM score ≤ −0.75 and binarized the genes for presence or absence of a point mutation with a functional consequence. Copy number profiles for 991 cancer cell lines generated using the Affymetrix SNP 6.0 arrays and processed with the PICNIC (Greenman et al, 2010) algorithm were obtained from the GDSC portal (Yang et al, 2012). The resulting total copy number calls were normalized by the ploidy level as follows:

Gene copy number gains and losses were called by setting a threshold of ≥ 0.5 and ≤ −0.5, respectively, for the normalized copy number values. The RMA normalized gene expression profiles generated by Affymetrix Human Genome U219 array for 1,156 cancer cell lines were available from the GDSC portal (Yang et al, 2012). For methylation profiles (Yang et al, 2012), raw intensities generated by Illumina HumanMethylation450 BeadChip were processed using the Illumina Methylation Analyzer (IMA) R package (Wang et al, 2012). Quality control was performed by removing CpG sites with missing rate > 5% and detection P > 0.05. For the Illumina array‐based methylation data, pre‐annotated indices of CpG sites within the 1,500 kb (as supplied by the manufacturer) were used as the closest approximations for defining gene promoter methylation. Gene‐level methylation intensities were obtained by averaging the methylation levels of CpG sites located in the annotated promoter site of each gene within a range of 200–1,500 base pairs (bp) upstream of the transcription start site. Ultimately, methylation levels for ~ 19,000 coding genes in 1,026 cancer cell lines were available for further analyses. Drug response profiles of 250 compounds tested in 1,075 cell lines were quantified using the DSS₂ method (Yadav et al, 2014) for GDSC1000 dataset, as obtained from the PharmacoDB database (Smirnov et al, 2018). Drug sensitivity screens were based on fluorescent nucleic acid stain probe Syto 60 assay to measure cell viabilities.

Additionally, proteomic and phospho‐proteomic profiles of 50 colorectal cancer cell lines generated by multiplexed quantitative mass spectrometry (MS)‐based proteomics technology were made available at the Sanger Institute (Roumeliotis et al, 2017). Multiplexing was performed by the isobaric labeling technology with tandem mass tag (TMT) reagents. Normalization step involved row‐mean scaling of column‐normalized calculation of intensities. Since we observed a minimal overlap between the phospho‐peptides of proteins that were identified using MS in this study, when comparing to the protein residues profiled using targeted RPPA in other studies, we averaged multiple phospho‐peptides corresponding to the same protein to generate protein‐level phosphorylation estimates. Gene dependency profiles were generated at the Sanger Institute as a part of Project SCORE for 324 cell lines with a pooled genome‐wide CRISPR‐Cas9 knockout screens performed using two pooled sgRNA libraries, the Human CRISPR Library v.1.0 and v.1.1 (Behan et al, 2019). We used the copy number bias‐corrected count fold changes as a measure of gene dependency in our analyses.

We reused gene expression profiles and mutation calls for 675 cell lines generated by RNA sequencing (Klijn et al, 2015). For the point mutation data, we included the mutations whose annotations were already provided by the study authors, categorized as deleterious using variant function annotator methods, such as SIFT (Kumar et al, 2009), Condel (González‐Pérez & López‐Bigas, 2011), and PolyPhen (Ramensky, 2002). To be consistent, for the subset of unannotated mutations, we further annotated the variants using FATHMM (Shihab et al, 2013) and selected mutations with score ≤ −0.75, and binarized the genes for presence or absence of a mutation with a functional consequence. Gene copy number profiles for 668 cancer cell lines were generated using the Illumina HumanOmni2.5 4v1 arrays and processed with the PICNIC algorithm. We used ploidy‐corrected copy number calls to categorize the amplifications and deletions. Copy number gains and losses were called by setting a threshold of ≥ 0.5 and ≤ −0.5, respectively. Drug response estimates for 16 compounds tested in 409 cell lines and quantified using the DSS₂ method (Yadav et al, 2014) were available from the PharmacoDB database (Smirnov et al, 2018). Drug sensitivity screens were based on CellTiter‐Glo (CTG) assay to measure cell viabilities.

The NCI‐60 cancer cell line profiling data were extracted through the CellMiner data portal (Shankavaram et al, 2009), followed by further processing for the meta‐analyses. Mutational profiles were generated by exome sequencing. We included the mutations that were categorized as deleterious by the SIFT (Kumar et al, 2009) and MA (Reva et al, 2011) variant function annotators, which were provided by the CellMiner data portal. We further annotated the variants using FATHMM (Shihab et al, 2013), selected those variants with score ≤ −0.75, and binarized the genes for presence or absence of a mutation with a functional consequence. We used summarized log‐scale intensities representing copy number profiles generated by combining probe intensities from four platforms (Agilent Human Genome CGH Microarray 44A, Nimblegen HG19 CGH 385K WG Tiling v2.0, Affymetrix GeneChip Human Mapping 500k Array Set, and Illumina Human1Mv1_C BeadChip). A threshold of ≥ 0.4 and ≤ −0.4 was used to call copy number gains and losses, respectively. Similarly, processed GCRMA normalized gene expression profiles generated with Affymetrix Human Genome U133 plus 2.0 array was used. For methylation data, raw intensities generated by Illumina HumanMethylation450 BeadChip were processed. Gene‐level methylation intensities were obtained by averaging the methylation levels of CpG sites located in the annotated promoter site of each gene within a range of 200 to 1,500 bp upstream of the transcription start site to calculate gene promoter level methylation as described in the earlier section. Log intensities of protein phosphorylation site levels on 94 proteins generated using 162 antibodies by RPPA were averaged for each protein. For proteomic profiles of the NCI60 panel cell lines, we used the label‐free iBAQ‐based quantitative estimates of protein levels (Gholami et al, 2013).

We downloaded the processed datasets from the Breast Functional Genomics data portal (Marcotte et al, 2016). Log ratios representing copy number profiles generated using the Human Omni‐Quad BeadChip array and processed using the Circular Binary Segmentation (CBS) algorithm (Olshen et al, 2004) were available for 79 breast cancer cell lines. A stringent threshold of ≥ 0.4 and ≤ −0.4 was used to call copy number gains and losses, respectively. Transcriptomic profiles were generated for 82 breast cancer cell lines. Log intensities of protein phosphorylation levels of 193 proteins were generated using 245 antibodies by RPPA and averaged for each protein. For gene dependency profiles, we used data from pooled genome‐wide shRNA screen performed on 120 cancer cell lines from breast, pancreatic, and ovarian tissue types (Koh et al, 2012; Marcotte et al, 2016). Gene dependency scores of each coding gene were estimated using the DEMTER2 algorithm (Tsherniak et al, 2017).

All the processed datasets for breast cancer cell lines were downloaded from the Synapse portal (Daemen et al, 2013; Costello et al, 2014). Transcriptomic and genomic profiles were produced by RNA sequencing and exome sequencing, respectively. Point mutations were annotated as described in earlier sections using FATHMM (Shihab et al, 2013). Log ratios representing copy number profiles generated using the Affymetrix Genome‐Wide Human SNP Array 6.0 and processed with the Circular Binary Segmentation (CBS) algorithm (Olshen et al, 2004) were available for 77 breast cancer cell lines. A threshold of ≥ 0.5 and ≤ −0.5 was used to call gene copy number gains and losses, respectively. For methylation data, raw intensities generated with Illumina Infinium Human Methylation27 BeadChip Kit were used for the genome‐wide detection of 27,578 CpG loci, spanning a total of 14,495 genes. Probe intensities were processed to derive gene‐level methylation intensities by averaging the methylation levels of CpG sites located in the annotated promoter site of each gene within a range of 200 to 1,500 bp upstream of the transcription start site. Log intensities of protein phosphorylation levels of 146 proteins generated by reverse phase protein lysate arrays were available. We utilized drug response profiles of 89 compounds tested in 71 cell lines and quantified using the DSS₂ method (Yadav et al, 2014), extracted from the PharmacoDB database (Smirnov et al, 2018). Drug sensitivity screens were based on CellTiter‐Glo (CTG) assay to measure cell viabilities.

We re‐analyzed log intensities of protein phosphorylation levels of 382 proteins generated using 452 antibodies by RPPA for 650 cancer cell lines were (Li et al, 2017). Intensities for multiple phospho‐sites from each protein were averaged.

Quantitative proteomic profiles of 41 breast cancer cell lines were generated using multiplexed quantitative mass spectrometry (MS)‐based proteomics technology (Lapek et al, 2017). Multiplexing was performed using the isobaric labeling technology with ten‐plex tandem mass tag (TMT) reagent, and the bridge‐normalized intensities were used in the analyses.

We processed the label‐free iBAQ‐based quantitative proteomic profiles of 20 breast cancer cell lines generated using non‐multiplexed label‐free quantitative mass spectrometry (MS)‐based proteomics technology (Lawrence et al, 2015).

We made use of gene dependency profiles for 8,195 genes in 398 cancer cell lines for which raw data were generated by pooled genome‐wide shRNA libraries (McDonald et al, 2017). shRNA level scores were collapsed to gene dependency scores of each coding gene using the DEMTER2 algorithm (Tsherniak et al, 2017).

Drug response estimates for 52 compounds tested in 50 cell lines and quantified using the DSS₂ method were obtained from the PharmacoDB database (Mpindi et al, 2016; Smirnov et al, 2018; Gautam et al, 2019). Drug sensitivity screens were based on CellTiter‐Glo (CTG) assay to measure cell viabilities.

We re‐analyzed label‐free quantitative (LFQ) estimates of proteomic profiles of 30 ovarian cancer cell lines generated with a label‐free quantitative mass spectrometry‐based proteomics technology (Coscia et al, 2016).

Since drug response profiles exist in the compound space, we projected them into gene space to create an additional functional data modality. To do this, we used our previously described pipeline, target addiction scoring (TAS), which transforms the drug response profiles into target addiction signatures (Jaiswal et al, 2019). The TAS pipeline makes use of drug poly‐pharmacology to integrate the drug sensitivity and target selectivity profiles through systems‐wide interconnection networks between drugs and their targets, including both primary protein targets as well as secondary off‐targets. The TAS approach is individualized in the sense that it uses the drug sensitivity profile of each cancer cell line screened separately against a library of bioactive compounds and then transforms the observed phenotypic profile into a cell line‐specific target addiction profile, hence enabling ranking of potential therapeutic targets based on their functional importance in the particular cell line.

We applied the TAS pipeline separately to each cell line drug response profile considered in the study. First, we obtained the set of potent protein targets for each drug from various drug‐target databases as described previously (Jaiswal et al, 2019; Dataset EV1). For instance, we retrieved at least one potent target for 349 of the 495 compounds profiled in the CTRP dataset. The rest of the compounds are either non‐targeted drug treatments or compounds with unknown target profiles. Likewise, protein targets were identified for 201/250 compounds in GDSC; 44/52 compounds in FIMM; 33/89 compounds in OHSU; 13/16 compounds in gCSI; and 19/25 compounds in CCLE dataset. For each individual target t, TAS_t was calculated by averaging the observed drug response (here, DSS₂) over all those n_t compounds that target the protein t. Eventually, we were able to derive the functional TAS profiles for a median of 222 targets in each dataset.

Spearman’s correlation analysis was conducted to evaluate the reproducibility of continuous molecular data types between any two studies. The reproducibility analyses were performed on the identical set of cell lines and on the common set of genes between any two datasets. Matthews correlation coefficient (MCC) was calculated to assess the consistency between binarized data types, such as gene‐level mutational profiles, or copy number gain and loss profiles. Only the correlation values with P < 0.05 calculated based on the molecular profiles with ≥ 10 genes or proteins were considered for further analysis.

Binarized datasets of gene copy number gains and copy number losses were generated from their continuous CNV profiles using the study‐specific thresholds, as specified above. Protein phosphorylation intensities from multiple residues mapping to the same protein were averaged to generate protein‐level phosphorylation estimates. Since the GDSC phospho‐proteome study in colorectal cancer cell lines used global MS technique for protein phosphorylation profiles (Roumeliotis et al, 2017), only those proteins that were also profiled in other research sites using targeted RPPA were considered for the meta‐analysis.

In the cell line‐specific gene Identification Pipeline (CLIP), we first define the notion of a Cancer cell line‐specific (CCS) gene, and then quantify the strength of evidence for each gene across data modalities and datasets from different laboratories. In this study, we only considered data modalities in the gene/protein space, and exclude drug response‐based profiles; however, TAS profiles were derived from the drug response phenotypes.

For continuous data modalities, i.e., gene expression, protein expression, gene methylation, protein phosphorylation, drug sensitivity, and gene dependency profiles, the strength of CCS evidence was estimated using one‐class rank product analysis performed separately for each gene and cell line combination from datasets across multiple laboratories using RankProduct package (Del Carratore et al, 2017). For methylation, gene expression, protein expression, and protein phosphorylation data modalities, genes with pfp < 0.10 were considered significant. As we observed lower consistency for functional gene dependency profiles, we used a less stringent threshold of pfp < 0.25 to increase the coverage of identifying robust CCS genes. These choices were informed by the reproducibility analyses of the multi‐modal profiles.

For binarized data modalities, i.e., copy number gain and loss and gene point mutation profiles, any alteration that was observed in ≤ 10% cell lines in any single dataset was considered as a CCS gene. For cell lines that were profiled in only one study, the rank product analysis could not be performed, and therefore, we selected all the top‐ranked genes: top‐100 genes (0.5% of all genes) for the continuous variable datasets, except for functional gene dependency modality (FUNC), where the top‐200 genes were considered (1% of all genes) to account for the higher noise in functional gene dependency profiles. For the protein phosphorylation datasets, we selected the top‐10 genes (0.5% of all proteins), considering that 200 proteins were assayed by reverse phase protein arrays on average.

Genes that were identified as CCS genes in two or more data modality types are considered in our analyses as robust CCS (rCCS) genes. Fisher's exact test was performed to identify the genes that are enriched in different breast cancer subtypes or pre‐defined subgroups of cancer cell lines. Overall, we applied the CLIP approach to 1,047 cancer cell lines for which data was available for ≥ 6 modalities (Dataset EV2).

The categorized CNV modalities (CNV_AMP and CNV_DEL) were used as inputs in the CLIP pipeline, instead of the continuous CNV data. Notably, while datasets from BROAD, OHSU, UHN, and NCI60 have been originally processed using the CBS algorithm, data from GDSC and gCSI have been processed using the PICNIC algorithm. To make the integrative approach more systematic and unbiased, we inspected the copy number distributions of two well‐known CNV alterations in breast tumors; ERBB2 which is known to be frequent amplification, and PTEN which is known to be a frequent deletion in breast cancer patients (Koboldt et al, 2012). For ERBB2, we looked at the distribution of its copy number values across all cell lines in each dataset (Appendix Fig S1A). We then identified the threshold for calling an amplification in each dataset at the value where we observed a sharp deflection. For BROAD, this analysis led to the threshold of ≥ 1 for calling amplifications. For, GDSC, gCSI, and OHSU, values ≥ 0.5; and for UHN and NCI60, value ≥ 0.4 were considered as amplifications. Similarly, for calling deletions; BROAD < 1; GDSC, gCSI, and OHSU < 0.5; and UHN and NCI60 < 0.4 were selected for the CLIP meta‐analysis based on the analysis of PTEN deletions (Appendix Fig S1B). We also executed CLIP at various detection thresholds and did not observe significant differences in its ability to identify breast cancer subtype‐specific driver genes that were used for benchmarking the performance of CLIP (Appendix Fig S1C).

As the first reference approach, we assessed the relative performance of the CLIP framework in comparison to each independent data modality alone for their ability to identify breast cancer subtype‐specific driver genes. For this benchmark, we first defined the set of “true” cancer driver genes as identified in the recent study (Bailey et al, 2018), separately for pan‐cancer and breast cancer tumor types, totaling to a set of 201 genes (Dataset EV3). To evaluate the performance of each data modality, we assumed the gene or protein expression up‐regulation, hypermethylation, and copy number gains as molecular phenotypes that are likely to correspond to subtype‐specific driver genes. Using the empirical Bayes framework of limma R package (Ritchie et al, 2015), we performed differential analysis between the two groups of breast cancer cell lines, “subtype+” and “subtype−” groups for each data modality and each study site. Next, for each data modality, we performed rank product analysis to integrate the evidence for the odds of differential levels, measured by B‐statistic (Ritchie et al, 2015), separately for datasets from every research site to identify robust subtype‐specific driver genes. Then, we characterized the subtype‐specific genes as up‐regulated vs. downregulated, hypermethylated vs. hypomethylated, gain vs. loss, based on the direction of average fold changes. Finally, we compared the fraction of true cancer drivers that were identified by CLIP and each individual data modality for each breast cancer subtype.

To assess how much the different modalities contributed to the performance of CLIP, in addition to the base setting of CLIP in which all the modalities were included, we also re‐ran the CLIP by removing each data modality one at a time, and measured the fraction of true positive driver genes identified after the removal of the input datasets (Appendix Fig S1D).

In addition, we also benchmarked our approach against the Multi‐Omics Factor Analysis (MOFA+) method, which models latent factors for integration of multi‐omics datasets (Argelaguet et al, 2020). MOFA+ is an unsupervised framework for discovering hidden factors that capture biological variability across multiple data modalities as well as within an individual modality. While MOFA+ has been primarily applied for integrating multi‐omics datasets from a single study site, in principle it can be applied also to multi‐omic datasets from multiple sites. However, due to the differences in distribution of the omics measurements, resulting from the various technological platforms that were used at the various sites for generating the datasets, we chose to apply MOFA+ to breast cancer cell lines from BROAD, GDSC, OHSU, and UHN only. Furthermore, we reasoned that the multi‐group framework in MOFA+ could be used to identify the latent factors that are specific to each breast cancer subtype, even though the authors of MOFA+ acknowledge that the multi‐group framework is not designed for capturing differential changes between the groups, but rather to identify the principal sources of variability that drive each group.

Specifically, we pursued the following strategy for the omics datasets from each site: (i) For BROAD, we considered CNV, MUT, METH, GEXP, PEXP, FUNC_CRISPR, and FUNC_RNAI data modalities. Each dataset was quantile‐normalized and subset to the set of breast cancer cell lines, before running the multi‐group MOFA+ model using its default settings. (ii) The proportion of variance explained for each group by each data modality/view for 15 latent factors were inspected (Appendix Fig S2), and the factors and views with the highest variance were then selected further. (iii) Top‐20 (top‐10 positive weights and top‐10 negative weights) view‐specific features loading on to each factor were extracted from the MOFA+ model and then compared against the established gold standard driver genes from a recent study (Bailey et al, 2018, Dataset EV3).

To further challenge the performance of CLIP as multi‐modal and multi‐site data integration approach, we assessed whether ECHDC1 could be identified by alternative methods that integrate multi‐omics dataset. First, we applied the MOFA+ method (Argelaguet et al, 2020), which combines datasets from a single research site. Because ECHDC1 was supported by GEXP and METH modality from CLIP, we applied MOFA+ method independently to all the datasets that profiled for methylation in breast cancer cell lines, namely BROAD, OHSU and GDSC. GEXP modality data from BROAD and OHSU were log‐transformed before input into MOFA+. Otherwise, the MOFA+ method was run using its default options without defining any groups. We performed the same steps (i)–(iii) as detailed in the previous section for the BROAD (CNV, GEXP, METH, MUT, FUNC_AVANA, FUNC_ACHILLES), GDSC (CNV, GEXP, METH, MUT, FUNC), and OHSU (CNV, GEXP, METH, MUT) datasets. The other modalities, such as PHOS and TAS, were not considered because of very low number of features compared to the other modalities (Appendix Fig S3).

Alternatively, we also devised a simplistic approach to investigate whether ECHDC1 could be identified by analyzing the METH modality only. As for CLIP, we performed z‐scaling for each gene on quantile‐normalized gene‐averaged methylation levels on breast cancer cell lines from each dataset. Next, to identify the outlier genes (i.e., the CCS genes in CLIP), specific to each cell line, we selected all the genes that had z‐scores above or below 1.66 standard deviations from the mean z‐score in each dataset. Following this, we checked the frequency of ECHDC1 being identified as an outlier gene in all the breast cell lines that were profiled in each dataset. We implemented this approach on methylation datasets from BROAD and GDSC (Appendix Fig S3). The method could not be implemented in the OHSU dataset because ECHDC1 methylation levels were not measured.

We further assessed the ability of CLIP to identify known oncogenic addictions by utilizing the set of cancer drivers, as defined in Bailey et al (2018), as true positives. After sub‐setting the list of driver genes to only those that were defined as oncogenes for pan‐cancer and epithelial cancer types, we had a total of 227 oncogenic driver genes. Next, to benchmark CLIP, we counted the frequency of each rCCS gene across all the epithelial cancer cell lines. We reasoned that the true oncogenic addictions will be identified more frequently as a rCCS gene, when compared to the non‐oncogenes. We ran the CLIP pipeline on the subset of epithelial cell lines, in the following three settings to identify rCCS genes: (i) including datasets of all modalities (number of epithelial cell lines = 737); (ii) including datasets of all modalities, but constraining rCCS gene selection criteria by compulsory identification as a rCCS gene in the FUNC modality (number of epithelial cell lines = 679); and (iii) including datasets of all modalities, except FUNC modality (number of epithelial cell lines = 736). Next, we compared the difference in rCCS frequency between oncogenes and non‐oncogenes groups in each setting using the Wilcoxon test.

To statistically identify candidate genetic interactions (Gis) between genes, Fisher’s exact test was performed to evaluate the difference in the proportion of rCCS genes between two groups of cancer cell lines, mutated and wild type, for all the well‐known cancer driver genes. For defining the subset of potential synthetic lethal (SL) interactions, we considered genetic interaction partners of only those rCCS genes that were also identified as essential genes, based on the evidence from the gene dependency modality (Dataset EV4). Out of the 2,018 cell lines that were included in the meta‐analyses, we considered 1,047 cell lines for which molecular data were available in ≥ 6 of the eight data modalities in gene space. Furthermore, we removed cell lines derived from bone, skin, nervous, and hematopoietic systems, restricting our interaction analyses to epithelial cancer cell lines (n = 679). For selecting the mutated driver genes with relevance in patient tumors, we considered the highly frequent driver genes in patient tumors from a recent pan‐cancer study of mutational landscape in The Cancer Genome Atlas (TCGA) dataset (Kandoth et al, 2013) and used a subset of the driver genes that were mutated in at least five cell lines.

We assessed the ability of CLIP to identify Gis between paralogous genes in comparison to using only the FUNC modality dataset. First, we obtained the list of paralogous genes in the human genome from the Duplicated Genes Database (DGD) (Ouedraogo et al, 2012). In total, we had a list 3,543 paralogs in the human genome that are constituted by 945 unique paralog groups. For instance, ARID1A and ARID1B are paralogs that belong to a unique group. Briefly, we conducted a Fisher’s exact test to assess the difference in proportions for rCCS genes between mutant and WT cell lines for each paralog group. For example, if ARID1A is mutated, then we tested whether ARID1B is enriched in proportion as a rCCS gene in mutant vs. WT cell line comparison. Or, vice versa, if ARID1B is mutated, then we evaluated whether ARID1A is enriched as an rCCS gene in those cell lines. We performed the genetic interaction analyses on paralogs that are mutated in at least 10 epithelial cell lines to have sufficient power to detect robust associations. If any gene from a paralog group was detected as statistically significant, we record that as a paralog GI association. Finally, we plot the proportion of number of unique paralog groups that have a paralog SL/GI association in relation to the total number of unique paralog groups that were tested. Moreover, we ran this analysis in different settings to evaluate the contribution of FUNC modality to the performance of CLIP; (i) FUNC Compulsory: GI analysis on rCCS genes from CLIP, but rCCS gene should be compulsorily identified as a rCCS gene in the FUNC modality (number of epithelial cell lines = 679). This is the default setting to identify SL relationships, since a gene has to be essential and rCCS. (ii) All modalities: GI analysis on rCCS gene from CLIP (number of epithelial cell lines = 737). Since essentiality is not mandatory criteria for rCCS calling, these associations can be generally categorized as genetic interactions. (iii) FUNC excluded: GI analysis on rCCS gene from CLIP by including all modalities, except for FUNC modality (number of epithelial cell lines = 736). Since, gene essentiality is not the included criteria for rCCS calling, these associations can be generally categorized as genetic Interactions.

To benchmark the performance of CLIP, we performed GI analysis in each individual FUNC dataset. Using the limma package (Ritchie et al, 2015), we performed a linear model differential expression analysis between mutant vs. WT cell lines for difference in gene essentiality scores. As described above, we subset to paralogs that are mutated in at least 10 epithelial cell lines. Associations with P < 0.05 were considered statistically significant. To compare the differences in proportions of paralogous GI associations using multiple strategies, we performed two proportions z‐test for difference in proportions of every comparison relative to CLIP setting (FUNC compulsory).

To perform survival analysis in breast cancer patients with high statistical power, we combined two patient cohorts with 20‐year follow‐up data and molecular profiling: the Metabric cohort (Curtis et al, 2012), which contains around 2,000 breast cancer patients with gene expression data, and the Oslo cohort (Fleischer et al, 2014, 2017), which contains 334 breast cancer patients with DNA methylation data. For the patients with DNA methylation data, an average DNA methylation level for all pre‐annotated indices of CpG sites within the range of 200–1,500 bp upstream of the TSS was calculated for each patient. We grouped the patients into two groups based on gene expression or promoter methylation of ECHDC1. To capture the effect of loss of tumor suppressor function, we identified the patients with either low expression or high methylation by selecting the patient tertile (one‐third of the patients) with the lowest expression or highest methylation (low GEXP group), which was compared to the rest of the patients (non‐low GEXP group). Subsequently, we performed survival analysis on the two groups (log‐rank test), where age was included as continuous covariate in a multivariate Cox model. The sex and race were not considered as relevant factors in Nordic breast cancer patients. Race information was not available for Metabric patients. The analyses were performed in R using the packages; survival and rms.

Immortalized breast epithelial cell line MCF10A and breast carcinoma cell line BT‐474 (both American Type Culture Collection; ATCC) were cultured according to manufacturer’s instructions in a humidified incubator with 5% CO₂ at 37°C and routinely checked and tested negative for mycoplasma contamination using MycoAlertPlus™ Mycoplasma Detection Kit (Lonza) according to manufacturer’s instructions.

Oligonucleotides (Merck) encoding single guide RNAs (sgRNA) against ECHDC1 (see Dataset EV5 for sequences) were cloned into LentiCRISPRv2 plasmid (#52961, Addgene) as described previously (Sanjana et al, 2014). Lentivirus particles were generated by seeding HEK293T cells at a density of 105 cells per cm². After 16 h, cells were transfected with transfer plasmid, packaging plasmids pCMV‐VSV‐G (Stewart et al, 2003; #8454, Addgene), and pCMV‐dR8.2 (Stewart et al, 2003; #8455, Addgene) using Lipofectamine 2000 (Life Technologies). Supernatants were collected 48 h after transfection. Cells were infected in 24‐well plates with lentiviral particles (MOI ~ 5) for 24 h in the presence of 8 µg/ml polybrene. The culture media were replaced for puromycin (Lentiguide; 1 µg/ml) containing media, and cells were selected for 3 days respectively.

To confirm the gene knockout, the ECHDC1 target regions were amplified with corresponding primers (Dataset EV6) using OneTaq^® DNA Polymerase (New England Biolabs; Cat: M0480). After the initial denaturation at 94°C for 9 min, 30 polymerase chain reaction (PCR) cycles were performed as follows 94°C for 40 s, 65°C for 30 s, and 72°C for 40 s. PCR‐amplified products were purified using a PCR purification kit (Macherey‐Nagel) and sequenced using Sanger sequencing (Applied Biosystems ABI3730XL DNA Analyzer).

Control and ECHDC1 knockout, MCF10A and BT‐474 cells were embedded as 5 × 10³ single cells (MCF10A) or 5 × 10² cell‐spheroids (BT‐474) in 3D collagen drops. BT‐474 cells were allowed to form tumor‐representing cell‐spheroids for 24 h as 5 µl hanging drops in complete media. Rat tail collagen type I (Sigma‐Aldrich) was dissolved in 0.25% acetic acid and diluted 1:1 with 2× MEM (Gibco) to final concentration of 2.25 mg/ml. Cells and pre‐formed spheroids were embedded in 30 µl collagen and incubated at 37°C up to 5 days. Cell growth was followed daily by phase‐contrast imaging using inverted epifluorescence microscope (Axiovert 200; Carl Zeiss). Collagen 3D drops were fixed at 0, 24, 48, 72, 96, and at 120 h using 4% PFA for 1 h at room temperature (RT).

Fixed 3D collagen drops were post‐fixed with ice‐cold 1:1 acetone‐methanol for 45 s and blocked with 15 % FBS—0.3% Triton‐X in PBS for 2 h at RT. Drops were incubated with Alexa Fluor 568 Phalloidin (1:40; ThermoFisher Scientific) for 4 h at RT, washed with 0.45 % Triton‐X in PBS followed by Hoechst 33342 staining (10 µg/ml; ThermoFisher Scientific) for 30 min at RT. After PBS, wash drops were mounted with Vectashield Antifade Mounting Medium (Vector laboratories).

Light micrographs were taken by using Zeiss AxioImager.Z1 upright epifluorescence microscope with Apotome combined with a computer‐controlled Hamamatsu Orca R2 1.3‐megapixel monochrome CCD camera and ZEN software. 20× Plan Apochromat 0.8 NA objective was used. Immunofluorescence quantifications were done by using ImageJ (Abramoff et al, 2004), and post‐acquisition image processing was performed using CorelDRAW 2020 software (Corel). Statistical analysis was carried out using one‐way ANOVA with Tukey’s multiple comparison test.

A total of 14 breast cell lines (see Dataset EV7), 7 for ECHDC1 rCCS positive (rCCS+) and 7 for rCCS negative (rCCS−), were selected for subsequent metabolite assays based on their ECHDC rCCS status. In total, 42 samples (14 lines with 3 replicates) were run with UPLC‐MS (MRM) method, each sample containing ca. 4 × 10⁶ cells. Culture conditions for each cell line are detailed in Dataset EV7.

Quenching protocol for adherent cells involved the following steps: Cells were washed with 2× volume of cold phosphate‐buffered saline (PBS) and incubated with trypsin (TrypLE™ Express Enzyme (1×), #12605010, Invitrogen) at 37°C until the cells detached. Trypsin was inactivated by adding an equal volume of cold fetal bovine serum (FBS, #10270‐106, Invitrogen). The cells were counted and centrifuged at 400 × g for 5 min at 4°C. The cells were washed with 2× volume of cold PBS, each time centrifuging at 400 × g for 5 min at 4°C. For each sample, 4 × 10⁶ cells were resuspended in 500 μl of cold PBS and centrifuged at 400 × g for 5 min at 4°C in 1.5 ml microcentrifuge tube. The cells were quickly washed with 2× volume of deionized water, not disturbing the pellet and not exposing the cells to water for more than 4–5 s. All water was aspirated from the tube. The cells were frozen in liquid nitrogen and stored at −80°C.

The protocol for cell disruption and extraction was adapted from a previously described method (Dettmer et al, 2011). Briefly, cells disrupted with a combination of freeze‐thaw cycle (−80/+4°C) and sonication. 1,000 µl of 80% MeOH (Honeywell, Riedel‐de‐Haën™, Seelze. Germany) and 10 µl of internal standard (ISTD) (conc. 10 µg/ml) was added to the purified cells. Then samples were ultra‐sonicated in ice bath for 10 min, put to liquid nitrogen back and forth three times. After cell disruption, samples were vortexed for an hour, centrifuged at 21,500 g for 5 min at +4°C. The second extraction was performed with 600 µl of 100% MeOH for 30 min. The supernatant (1,600 µl) was dried under vacuum (MiVac Duo concentrator, GeneVac Ltd, Ipswich, UK), reconstituted to 50 µl of 0.1% formic acid in acetonitrile (ACN)/H₂O (Honeywell, Riedel‐de‐Haën™, Seelze. Germany), and run with UPLC‐QTRAP/MS with ESI (+) and (−) switching (ExionLC UPLC, ABSciex; 6500+ QTRAP‐MS, ABSciex).

Ten microliter (μl) of extract was injected into the LC column with the mobile phase flow of 0.4 ml/min at +35°C. The LC separation was carried out on a reversed‐phase UPLC‐column (Waters Acquity BEH C18, 150 × 2.1 mm, Ø 1.7 µm). A gradient elution of the analytes was achieved using a program with mobile phases A (aqueous 0.1% formic acid) and B (0.1% formic acid in ACN). The linear gradient started at 99% A and 1% B was held for 2 min and proceeded from 1% B to 10% in 2 min, 10% B to 90% in 2 min, held at 90% for 1 min, then switched back to 1% B and left to stabilize for 2 min. Total run time of 9 min.

In total, 12 metabolites were analyzed from the cells: aspartate (Asp), glutamate (Glu), glutamine (Gln), α‐ketoglutarate (α‐KG), fumarate, succinate, malate, citrate, isocitrate, 2‐hydroxy‐3‐methylbutyrate (2OH‐3MBA), malonate, and methylmalonate (MMA). D₃‐methylmalonate (D₃‐MMA) was used as an ISTD. Multiple Reaction Monitoring (MRM) transitions for 13 analytes were monitored: m/z 134/74 for Asp, 148/56 for Glu, and 147/84 for Gln in positive mode; and m/z 145/83 for α‐KG, 114.9/71 for fumarate, 117.1/73 for succinate, 132.8/71 for malate, 190.8/111 for citrate, 190.8/73 for isocitrate, 117.1/71 for 2OH‐3MBA, 103/59 for malonate, 117/73 for MMA and 120/76 for D₃‐MMA (ISTD) in negative mode. Calibration curves with five standard mixes (conc. 100, 10, 1, 0.1, and 0.01 μg/ml) were used for quantification for all 12 metabolites. The samples were not normalized to cell count, but only for ISTD, hence unit μg/ml.

We used the Gene‐Module Association Determination (G‐MAD) algorithm, implemented in the GeneBridge toolkit (Li et al, 2019), to predict the gene function of ECHDC1. G‐MAD considers transcriptome data sets from six species (human, mouse, rat, fly, worm, and yeast), and performs a competitive gene set testing method—Correlation Adjusted mEan RAnk gene set test (CAMERA) (Wu & Smyth, 2012), which adjusts for inter‐gene correlations to compute the enrichment between gene‐of‐interest and biological modules. Gene‐module connections with enrichment P‐values that survive multiple testing corrections of the gene or module numbers are scored and normalized (range: −1 to 1). We restricted our analysis to breast tissue datasets (n = 153), compiled in the GeneBridge expression database to enrich for breast tissue‐specific associations (Dataset EV8).

This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol