The three datasets (refined sets) are downloaded from PDBbind (www.pdbbind.org.cn). The core set is used as the test dataset, and the training dataset is the refined set excluding the core set. The detailed data information can be found in Table 2.
The refined sets are composed of training set and test set (core set).
In our ES-IDM–based PerSpect simplicial complex models, the distance value is considered as the filtration parameter. The filtration value goes from 0.00 to 25.00 Å̊. For discretization, Laplacian matrices are generated with a step of 0.10 Å̊. That is to say, a total of 250 Laplacian matrices are generated from each filtration process. There are, in total, 4*9 = 36 types of ES-IDMs between 4 types of atoms from protein, including C, N, O, and S, and 9 types of atoms from ligand, including C, N, O, S, P, F, Cl, Br, and I. In our ES-IEM–based PerSpect simplicial complex models, the interaction strength is used as the filtration parameter and its value goes from 0.00 to 1.00. In our calculation, the constant c is set to be 100. The Laplacian matrix is generated with a step of 0.01, meaning a total 100 Laplacian matrices for each filtration process. There are 5*10 = 50 types of ES-IEMs, between 5 types of atoms from protein, including H, C, N, O, and S, and 10 types of atoms from ligand, including H, C, N, O, S, P, F, Cl, Br, and I.
Furthermore, we consider 11 PerSpect features as follows: Dim(0) persistent multiplicity (of zero eigenvalue), Dim(1) persistent multiplicity (of zero eigenvalue), persistent maximum, persistent minimum, persistent mean, persistent SD, persistent Laplacian graph energy, persistent generalized mean graph energy, PerSpect moment (second order), persistent quasi-Wiener index, and persistent spanning tree number.
Note that other than the persistent multiplicity, all PerSpect attributes are calculated from Dim(0) Laplacians. To sum up, in our ES-IDMs, there are 36 types of atom combinations as stated above, and the total number of features is 36[types]*250[persistence]*11[eigen feature]. Similarly, there are 50 types of ES-IEMs, and the number of features is 50[types]*100[persistence]*11[eigen feature]. Because we have a large feature vector, decision tree–based models are considered to avoid overfitting. In particular, GBT models have delivered better results in protein-ligand binding affinity prediction. The parameters of GBT are listed in Table 3. Note that 10 independent regressions are conducted, and the medians of 10 PCCs and root mean square errors are computed as the performance measurement of our PerSpect ML model.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.