Graph canonicalization In knowledge graphs, several nodes may co-exist, while representing the same entity. For example, a unique drug in PGxLOD can be represented by two nodes: one from PharmGKB and another from DrugBank. Each of these two nodes have their own connections to other nodes in the knowledge graph. Therefore, the union of their edges constitutes all the available knowledge about the drug they both represent. As they represent the same drug, they are connected through an owl:sameAs edge. In order to avoid traversing such edges, we canonicalize the knowledge graph, that is to say, nodes linked by owl:sameAs edges are grouped into a unique node as a pre-processing step, before building graph-based features. Such a procedure corresponds, in graph theory, to the contraction of owl:sameAs edges.
Paths, path patterns, and neighbors as drug features
From an arbitrary set of drugs D (such as DILI DILI) and a canonicalized knowledge graph (such as PGxLOD), a set of graph-based features is built on D. We distinguish three kinds of features: paths, path patterns and neighbor nodes. First, we build paths rooted by drugs from D. The neighborhood of each drug is explored in the graph with a max distance of k, generating sequences of predicates and nodes of max length k. Accordingly, if a path is found in the knowledge graph (with ), the drug d is associated with the feature in the output matrix.
Second, we build path patterns that generalize paths by considering ontology classes instantiated by nodes. The aim of path patterns is to offer more general descriptions, which have more chances to be shared by several entities. For example, if instantiates and instantiates , we add the following path patterns: and . It is noteworthy that, to allow a high level of generalization, we always consider the top-level class in the generalization procedure. To leverage hierarchies organizing ontology classes, a node n is only generalized by ontology classes at a distance of at most t from itself, following instantiation and subsumption edges.
Third, we list neighboring nodes, i.e., any node that can be reached from within a distance of max k. We do not keep track of the distance between d and its neighbors: if a node n can be reached from a drug after 2 hops and from a drug after 3 hops, these two drugs will be associated with the neighbor n in the output matrix. Interestingly, a neighbor n can be represented by the general path pattern .
Features can potentially be noisy and numerous, leading to a combinatorial explosion of their number. That is why, we add several constraints to only keep the most interesting features. First, we only keep the most specific paths and path patterns, considering that a node is more specific than the classes it instantiates and a class is more specific than its superclasses. A path or path pattern is more specific than another path or path pattern if each node/class in is more specific than the node/class at the same position in . Thus, if is associated with the exact same drugs as , then the path pattern is removed and only the path constitutes a feature. Similarly, if is associated with the exact same drugs as , then we only keep the first one. Second, the exploration is stopped at nodes whose degree is greater than a parameter deg. Such nodes are usually called hubs [27], and expanding a path ending at a hub may generate a large number of new paths associated with the same drugs. Third, we consider minimal and maximal supports of features (denoted and ) as additional filters. The support of a feature consists in the number of drugs associated with a feature, i.e., neighbors of a same entity or rooting a path or a path pattern. Only features associated with more than drugs and less than drugs are output in the final matrix. Fourth, two blacklists avoid traversing noisy or unwanted edges. Edges are not traversed if their predicate is blacklisted (in ) or if the adjacent node instantiates (directly or indirectly) a blacklisted class (in ). For example, we blacklist predicates of PROV-O that are used to describe provenance metadata. By blacklisting rdf:type in , we make sure the exploration is performed through entities and not classes of PGxLOD. Relations between classes are only considered when generalizing paths. Aside from noisy features and combinatorial explosion, we use these blacklists to prevent considering features that “obviously” carry the signal we are trying to predict. For example, we blacklist all “side-effects” links from SIDER, which may directly link drugs in D with the side effect we aim at predicting. We also blacklist all classes from MeSH related to SCAR or DILI to avoid taking into account nodes instantiating them. A third blacklist () avoids generalizing nodes in paths by blacklisted classes. This is particularly useful to withdraw general classes (e.g., Drug) that increase the number of generated path patterns while not adding useful information.
Besides previous topological constraints, we also perform a domain-driven filtering configured by parameter m. Indeed, in our objective of explaining ADRs, experts highlighted that features that may be explanatory to them mention pathways, genes, GO, and MeSH terms. For this reason, we propose three post-processing atomic filters, only keeping neighbors, paths or path patterns containing at least a pathway (), a gene or a GO class (), or a MeSH class (). Such atomic filters can be combined to form disjunctive filters. For example, the filter keeps neighbors, paths or path patterns containing at least a pathway or a gene or a GO class.
Table 4 summarizes the parameters that limit the number of features in the output matrix. Additional details about our method of feature construction are provided in [29].
Parameters used to limit the number of features
Each parameter is associated with a domain for its value
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.