Benchmark Dataset and Sample Formulation

XW Xiangeng Wang
YW Yanjing Wang
ZX Zhenyu Xu
YX Yi Xiong
DW Dong-Qing Wei
request Request a Protocol
ask Ask a question
Favorite

We utilized the same dataset as the previous study (Cheng et al., 2017b) to facilitate model comparison. This dataset consists of 3,883 drugs, and each drug is labeled with at least one or more of 14 main ATC classes. It is a tidy dataset where no missing value and contradictory record. The UpSet visualization technique (Lex et al., 2014) was used for quantitative analysis of interactions of label sets.

Then, we adopted the same method provided by (Cheng et al., 2017b) to represent the drug samples. The dataset can be formulated in set notation as the union of elements in each class: S=S1S2S14 (1), and a sample D can be represented by concatenating the following three types of features.

A 14-dimentional vector, D Int = [Φ1Φ2Φ3 … Φ14]T (2), which represents its maximum interaction score Φi (Kotera et al., 2012) with the drugs in each of the 14 Si.

A 14-dimentional vector, D StrSim = [Ψ1Ψ2Ψ3 Ψ14]T (3) which represents its maximum structural similarity score Ψi (Kotera et al., 2012) with the drugs in each of the 14 Si.

A 14-dimentional vector, D FigSim = [T1T2T3 … T14]T (4), which represents its molecular fingerprint similarity score Ti (Xiao et al., 2013) with the drugs in each of the 14 Si.

Therefore, a given drug D is formulated by:

Where ⊕ represents the symbol for orthogonal sum and where

For more details, refer to Cheng et al. (2017b).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A