Graph-Based Signatures to Represent Small Molecules

Graph modeling is a well-established mathematical representation used to model chemical entities, which relies on the structural fingerprints of molecular descriptors to determine the relationships between molecular structures and their biological activities. These signatures have been proven to be a general and powerful tool to model the physicochemical properties of small molecules14 and other biological entities.16,44,45 We have previously proposed the concept of graph-based signatures to represent protein structure geometry and the molecular interactions with their binding partners as graphs.19,4649 These were successfully used and adapted to train and test many different machine learning models, such as the prediction and optimization of pharmacokinetic and toxicity properties using a pkCSM tool.15 We employed and adapted these distance-based signatures to model small-molecule chemistry, enabling the prediction of their anticancer properties.

There are two key components of the graph-based signatures: (i) compound physicochemical properties obtained via the RDKit cheminformatics library50 and (ii) distance-based signatures, described as a cumulative distribution function of distances in atoms defined based on their corresponding physicochemical properties (pharmacophores) (Table S1). The distance-based patterns are encoded in a small-molecule graph-based signature that was adjusted from the Cutoff Scanning matrix method.51 In this approach, each dimension of the molecular signature expresses the number of atoms (characterized by pharmacophore class) within a particular distance in the graph. The cost of the shortest path is based on the shortest distance between any two nodes in the molecular graph, calculated by Johnson’s algorithm. It is described as the total weights of the edges on the path, where all the edges are considered to have unitary weight (Figure S25). Hence, the value of the shortest path is expressed as the number of edges in it.15 Using the graph-based signature approach, a total of 264 features were obtained and used to train and test the predictive models.

It is worth noting that there are other ways to represent small molecules in order to build machine learning for molecular prediction. For instance, one of the current approaches is based on deep feature generation through graph neural networks (GNN).52 Although being a successful approach, GNN-generated features have, as a major drawback, the lack of an inherent interpretability, which is a natural aspect from the graph-based signatures. Accordingly, graph-based signature features were first preferred for pdCSM-cancer than others. In future work, other types of features (e.g., deep GNN features) will be incorporated into the pdCSM-cancer models after carrying out an analysis of their predictive benefits.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.