The DILI classification is provided for nine independent data sets derived by CAMDA 2020 CMap Drug Safety Challenge. It contains six gene expression data sets from human cell lines, chemical descriptors of drugs, cell-based screening of pathway perturbations of the drugs, and information on reported DILI incidents from FDA FAERS database. The structure of the entire data set is shown in Figure 1.
Structure of drug-induced liver injury (DILI) data sets. Each vertical bar corresponds to the compound that is present in a given set. Only MOLD and FAERS data sets contain information on all compounds.
The gene expression data for the study was generated using the L1000 Platform (Subramanian et al., 2017), developed for Connectivity Map (Lamb, 2007) at the Broad Institute. The Connectivity Map (also known as cmap) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules.
L1000 is a gene-expression profiling assay based on the direct measurement of a reduced representation of the transcriptome and computational inference of the portion of the transcriptome not explicitly measured. The abundance of ~1,000 landmark transcripts is measured directly. Eighty additional invariant transcripts are also explicitly measured to enable quality control, scaling, and normalization. Measurements of transcript abundance are made with a combination of a coupled ligase detection and polymerase chain reaction, optically addressed microspheres, and a flow-cytometric detection system. The expression of the remaining genes is inferred computationally from that of the measured ones.
The following human cell lines were used in the current study:
A375: human melanoma—347 observations,
HA1E: human embryonic kidney—347 observations,
HPEG2: human liver cancer—235 observations,
MCF7: breast cancer—415 observations,
PC3: human prostate cancer—415 observations,
PHH: primary human hepatocytes (currently considered to be the gold standard for hepatic in vitro culture models)—171 observations.
Chemical descriptors of drugs were computed with help of Mold2 program (Hong et al., 2008). Mold2 computes a large and diverse set of molecular descriptors encoding two-dimensional chemical structure information. Tox21 database (Huang et al., 2016) contains cell-based screening of pathway perturbations of the drugs. The FDA Adverse Event Reporting System (FAERS) (Kumar, 2019) is a database that contains information on adverse event and medication error reports submitted to FDA. The database is designed to support the FDA's post-marketing safety surveillance program for drug and therapeutic biologic products. Unfortunately, FAERS is not useful for predicting effects of new compounds.
Challenge organizers provided several alternative classifications of DILI based on two different classification schemes: DILI severity score and commercial status of the drug (Chen et al., 2016; Li et al., 2020). Additionally, two further DILI decisions were provided. They were later discovered to be controls for overfitting and for predictive potential of the approach used by participants. One was simply a random decision not connected to any descriptors whatsoever, another was a decision based on one of the molecular descriptors generated by Mordred. Altogether there were six different DILI scales provided to participants:
DILI severity score (decision DILI2 in the challenge) (see Table 1);
Number of objects in DILI2 classes.
binary DILI severity score ≤ 6 (decision DILI1 in the challenge);
Decision based on the commercial status of the drug (decision DILI4 in the challenge) with following classes: “withdrawn,” “box warning,” “warning and precaution,” “adverse event,” and “no match” (see Table 2);
Number of objects in DILI4 classes.
decision based on the commercial status of the drug (decision DILI3 in the challenge) with following binary classes: “withdrawn,” “box warning,” and “warning and precaution” vs. “adverse event” and “no match”);
the artificial DILI class (decision DILI5 in the challenge) that was discovered to be a non-informative random decision (negative control);
the artificial DILI class (decision DILI6 in the challenge) that was constructed using molecular weight of compound as decision (positive control).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.