We use ten datasets in this study, including six main datasets and four independent sets (Table 7). The main datasets consist of four classification sets (DUD-E, Human, C. elegans and KIBA) and two regression sets (PDBbind and Davis). The independent sets include two regression sets, CASF-2013 and Astex Diverse, and two classification set, MUV and BindingDB.

An overview of relevant datasets

Asterisk (*) indicates structure dataset

Best values are higlhlighted in bold

The Directory of Useful Decoys Enhanced (DUD-E), a benchmark dataset for evaluating structure based virtual screening methods, is used for classification [26]. The target clustering method is applied to avoid redundancy between the training and testing sets. Similar to Ragoza et al., 3-fold cross validation is used to evaluate our model, and proteins are clustered using global sequence alignment to ensure that targets with greater than 80% sequence identity are included in the same fold during cross-validation. The negative-to-positive ratio of is set to 3:1 to avoid an imbalance of data. A total of 91,220 samples from DUD-E dataset are used in this study. Additionally, the maximum unbiased validation (MUV) dataset, which consists of 17 targets, each with 30 actives and 15,000 decoys, is used as an independent test set. In this study, 9 of 17 targets with structural information are used to make comparisons with structure-based methods. Similarly, the Human with 3369 positive and 10,107 negative samples, and C. elegans datasets with 4,000 positive and 12,000 negative samples are used for classification [16]. The kinase inhibitor bioactivity (KIBA) dataset is used for classification [27]. Similar to a previous study [28], the KIBA threshold of 3.0 becomes 12.1 after transformation and protein–ligand interactions with values bigger than 12.1 are regarded as positive samples. All 118,253 samples from KIBA are only used for classification in this study due to the different bioactivity values within the dataset. A dataset containing 2706 positive and 2802 negative samples, which was carefully curated from BindingDB database, is used as an independent test set [24].

The PDBbind v.2016 database, which provides structural complexes with the corresponding binding affinity data (Kd, Ki), is used for regression [29]. To evaluate the generalizability of our model, the CASF-2013 benchmark with 195 complexes and the Astex Diverse set with 73 complexes (samples without binding affinity and those present in PDBbind (1YVF in the general set) are excluded) are used as additional independent test sets [21, 30]. Specifically, we split the PDBbind set exactly the same way as Pafnucy to facilitate a fair comparison. Briefly, the procedure is described as follows. (i) The whole core set (290 complexes) is used as an external test set. (ii) A total of 1000 complexes (same as Pafnucy) from the refined set are used for validation. (iii) The remaining complexes from the refined and general sets are used for training. Thus, 13,196 complexes from PDBbind are used for regression. Similarly, the kinase dataset Davis consisting of a total of 30,056 interactions with the corresponding binding affinity (Kd) is used for regression [31]. It should be noted that most samples within Davis have binding values of 5, which would cause an imbalanced distribution for our total dataset. Thus these samples are removed and a total of 9,125 samples of Davis are used. Finally, 271,816 interactions are used in this study, including 22,589 for regression and 249,227 for classification.

