Set N: We constructed a non-redundant database, set N, consisting of 6,635 ligand-bound structures obtained from the Binding MOAD released in 2014 27 to derive the MF-RA. Binding MOAD groups structures at the 90% sequence identity level, and all of the structures in Binding MOAD having more than 70% sequence identity with any protein in set T or set S were excluded from set N.
Set T and set S: Set T and set S are two benchmark protein structure databases that were used to evaluate the performances of the different ligand-binding site prediction methods. The two test databases have been widely used in previous studies of ligand-binding site prediction methods15,19,20,28. Set T consists of 210 ligand-bound protein structures, whereas set S includes 96 structures that can be grouped into two classes, specifically 48 ligand-unbound protein structures and their corresponding ligand-bound forms. To obtain the actual ligand-binding sites of the 48 ligand-unbound structures, the ligand-bound structures were aligned with their ligand-unbound forms using the PyMOL align function, and the ligands’ coordinates and connectivity information were obtained from the ligand-bound structures29.
Set L: The average molecular weight of the ligands in set S is as high as 269 dalton, whereas that of some ligands, such as “NAD” and “HEM”, exceeds 500 dalton. Additionally, large ligand-binding sites can be easily identified by geometry-based methods or even by eye based on the three-dimensional (3D) protein structures. Set L was constructed to evaluate the performances of different methods for small-volume ligand-binding site prediction and consists of 169 ligand-bound structure chains downloaded from the Protein Data Bank (PDB)30. In the PBD, each structure chain has only one ligand, and the molecular weight of the ligand should be less than 150 dalton. Inorganic molecules and metals in protein structures are ignored, the ligand should not be completely exposed to the solvent, and the ligand-binding site is formed by the only chain in the structure. In set L, no two structures have more than 70% sequence identity. Detailed information for set L is provided in Table S6.
Set P: Set P was derived from a database of dimeric protein complexes that consists of 1,611 structures obtained in previous studies31,32. No two chains from different protein complexes share a sequence identity of 35% or higher. In addition, all ligands in proteins should be located at protein-protein interfaces: the distance between a ligand and both protein chains, defined as the shortest distance between any ligand atom and any residue atom that belongs to the protein chains, is less than 5 Å. After refinement, set P includes 149 protein structures and was constructed to evaluate the accuracy of the prediction of ligand-binding sites on protein-protein interaction region. Detailed information for set P is provided in Table S7.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.