The gene expression data used in this study was gathered from the CMap L1000 Assay Platform [15]. The L1000 Assay Platform provides more than one million gene expression profiles from a wide range of cell lines treated with different drugs at different doses and treatment durations. Assuming that gene expression is highly correlated, the Platform features a subset of approximately 1000 landmark genes to derive profiles that serve to infer the expression of the rest of genes. We used CMAP L1000 level 5 data which contained z-score values corresponding to the normalized differential expression between the drug treatment and control across different conditions.
We manually curated a list of phenotypes closely related with DILI and identified the genes associated with these phenotypes using the DisGeNET database v6.0 [17] (Table (Table2).2). We restricted disease-gene associations solely to expertly curated repositories: UniProt [18], the Comparative Toxicogenomics Database (CTD) [19], ORPHANET [20], the Clinical Genome Resource (CLINGEN) [21], the Genomics England PanelApp [22] and the Cancer Genome Interpreter (CGI) [23]. We kept only the phenotypes with at least 10 curated gene associations. The full list of associations between DILI phenotypes and genes can be found at Supplementary Table 1.
List of manually selected phenotypes related with DILI. The selected phenotypes were required to have 10 gene associations or more. The genetically redundant phenotypes have been merged in the same term. The empty cells correspond to phenotypes for which the expansion through the network using GUILDify was not functionally coherent
The chemical structures of the drugs considered in the study were provided by the CAMDA challenge in the form of Simplified molecular-input line-entry system (SMILES) string. In order to use this type of data, we calculated the similarity between all compounds, creating a matrix of chemical similarity. Specifically, we used the R package RxnSim [24] to calculate the similarity matrix using the Tanimoto distance [25]. We used the function ms.compute.sim.matrix (default parameters), which identifies the fingerprints of the SMILES and computes the fingerprint similarity between pairs of SMILES. The full list of SMILES is provided in Supplementary Table 2, and the matrix of Tanimoto similarity between SMILES in Supplementary Table 3.
The targets of the compounds considered in the study were retrieved from three different databases: DGIdb [26], HitPick [27] and SEA [28]. DGIdb gathers validated drug targets, whereas HitPick and SEA additionally provide predicted targets based on chemical similarity. We used the names of the drugs to retrieve the drug-protein associations from DGIdb, whereas the SMILES strings were used in the case of HitPick and SEA web servers. Any drug-protein pair that had been provided either by the database or predicted to interact by the web servers were included among the drug-target associations. This implies that there are no differences between validated and predicted targets. However, this allowed us to increase the number of input drugs and extended the potential recall of our method. After collecting all targets, a matrix was created with all the drugs in rows and all the target proteins in columns. The cells of the matrix had values 1 (if the drug targeted the protein) and 0 (otherwise). There are three drugs from the DILIrank dataset (alaproclate, fluvastatin and tenofovir) and two drugs from the independent hold-out test dataset (entecavir and vinorelbine) without any targets in these databases. These drugs have not been used neither for training nor for testing when using drug targets as features. The full list of drug-target associations is provided in Supplementary Table 4.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.