Graph-based feature construction

Emmanuel Bresso; Pierre Monnin; Cédric Bousquet; François-Elie Calvier; Ndeye-Coumba Ndiaye; Nadine Petitpain; Malika Smaïl-Tabbone; Adrien Coulet

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Graph-based feature construction

EB Emmanuel Bresso

PM Pierre Monnin

CB Cédric Bousquet

FC François-Elie Calvier

NN Ndeye-Coumba Ndiaye

NP Nadine Petitpain

MS Malika Smaïl-Tabbone

AC Adrien Coulet

This method is extracted from research article: BMC Med Inform Decis Mak, May 2021

Investigating ADR mechanisms with Explainable AI: a feasibility study with knowledge graph mining

DOI: 10.1186/s12911-021-01518-6

Request a Protocol

Ask a question

Favorite

Graph canonicalization In knowledge graphs, several nodes may co-exist, while representing the same entity. For example, a unique drug in PGxLOD can be represented by two nodes: one from PharmGKB and another from DrugBank. Each of these two nodes have their own connections to other nodes in the knowledge graph. Therefore, the union of their edges constitutes all the available knowledge about the drug they both represent. As they represent the same drug, they are connected through an owl:sameAs edge. In order to avoid traversing such edges, we canonicalize the knowledge graph, that is to say, nodes linked by owl:sameAs edges are grouped into a unique node as a pre-processing step, before building graph-based features. Such a procedure corresponds, in graph theory, to the contraction of owl:sameAs edges.

Paths, path patterns, and neighbors as drug features

From an arbitrary set of drugs D (such as DILI $\oplus$ $\cup$ DILI $⊖$ ) and a canonicalized knowledge graph (such as PGxLOD), a set of graph-based features is built on D. We distinguish three kinds of features: paths, path patterns and neighbor nodes. First, we build paths rooted by drugs from D. The neighborhood of each drug $d \in D$ is explored in the graph with a max distance of k, generating sequences of predicates and nodes of max length k. Accordingly, if a path $d \overset{p_{1}}{\to} n_{1} \overset{p_{2}}{\to} n_{2}$ is found in the knowledge graph (with $k = 2$ ), the drug d is associated with the feature $\overset{p_{1}}{\to} n_{1} \overset{p_{2}}{\to} n_{2}$ in the output matrix.

Second, we build path patterns that generalize paths by considering ontology classes instantiated by nodes. The aim of path patterns is to offer more general descriptions, which have more chances to be shared by several entities. For example, if $n_{1}$ instantiates $C_{1}$ and $n_{2}$ instantiates $C_{2}$ , we add the following path patterns: $\overset{p_{1}}{\to} C_{1} \overset{p_{2}}{\to} n_{2}, \overset{p_{1}}{\to} n_{1} \overset{p_{2}}{\to} C_{2}, \overset{p_{1}}{\to} C_{1} \overset{p_{2}}{\to} C_{2}, \overset{p_{1}}{\to} ⊤ \overset{p_{2}}{\to} n_{2}, \overset{p_{1}}{\to} n_{1} \overset{p_{2}}{\to} ⊤,$ and $\overset{p_{1}}{\to} C_{1} \overset{p_{2}}{\to} ⊤, \overset{p_{1}}{\to} ⊤ \overset{p_{2}}{\to} C_{2}, and \overset{p_{1}}{\to} ⊤ \overset{p_{2}}{\to} ⊤$ . It is noteworthy that, to allow a high level of generalization, we always consider the top-level class $⊤$ in the generalization procedure. To leverage hierarchies organizing ontology classes, a node n is only generalized by ontology classes at a distance of at most t from itself, following instantiation and subsumption edges.

Third, we list neighboring nodes, i.e., any node that can be reached from $d \in D$ within a distance of max k. We do not keep track of the distance between d and its neighbors: if a node n can be reached from a drug $d_{1}$ after 2 hops and from a drug $d_{2}$ after 3 hops, these two drugs will be associated with the neighbor n in the output matrix. Interestingly, a neighbor n can be represented by the general path pattern ${(\overset{*}{\to} *)}^{{0, k - 1}} \overset{*}{\to} n$ .

Features can potentially be noisy and numerous, leading to a combinatorial explosion of their number. That is why, we add several constraints to only keep the most interesting features. First, we only keep the most specific paths and path patterns, considering that a node is more specific than the classes it instantiates and a class is more specific than its superclasses. A path or path pattern $P_{1}$ is more specific than another path or path pattern $P_{2}$ if each node/class in $P_{1}$ is more specific than the node/class at the same position in $P_{2}$ . Thus, if $\overset{p_{1}}{\to} n_{1} \overset{p_{2}}{\to} n_{2}$ is associated with the exact same drugs as $\overset{p_{1}}{\to} C_{1} \overset{p_{2}}{\to} C_{2}$ , then the path pattern is removed and only the path constitutes a feature. Similarly, if $\overset{p_{1}}{\to} C_{1} \overset{p_{2}}{\to} C_{2}$ is associated with the exact same drugs as $\overset{p_{1}}{\to} ⊤ \overset{p_{2}}{\to} ⊤$ , then we only keep the first one. Second, the exploration is stopped at nodes whose degree is greater than a parameter deg. Such nodes are usually called hubs [27], and expanding a path ending at a hub may generate a large number of new paths associated with the same drugs. Third, we consider minimal and maximal supports of features (denoted $s_{min}$ and $s_{max}$ ) as additional filters. The support of a feature consists in the number of drugs associated with a feature, i.e., neighbors of a same entity or rooting a path or a path pattern. Only features associated with more than $s_{min}$ drugs and less than $s_{max}$ drugs are output in the final matrix. Fourth, two blacklists avoid traversing noisy or unwanted edges. Edges are not traversed if their predicate is blacklisted (in $b_{predicates}$ ) or if the adjacent node instantiates (directly or indirectly) a blacklisted class (in $b_{exp-types}$ ). For example, we blacklist predicates of PROV-O that are used to describe provenance metadata. By blacklisting rdf:type in $b_{predicates}$ , we make sure the exploration is performed through entities and not classes of PGxLOD. Relations between classes are only considered when generalizing paths. Aside from noisy features and combinatorial explosion, we use these blacklists to prevent considering features that “obviously” carry the signal we are trying to predict. For example, we blacklist all “side-effects” links from SIDER, which may directly link drugs in D with the side effect we aim at predicting. We also blacklist all classes from MeSH related to SCAR or DILI to avoid taking into account nodes instantiating them. A third blacklist ( $b_{gen-types}$ ) avoids generalizing nodes in paths by blacklisted classes. This is particularly useful to withdraw general classes (e.g., Drug) that increase the number of generated path patterns while not adding useful information.

Besides previous topological constraints, we also perform a domain-driven filtering configured by parameter m. Indeed, in our objective of explaining ADRs, experts highlighted that features that may be explanatory to them mention pathways, genes, GO, and MeSH terms. For this reason, we propose three post-processing atomic filters, only keeping neighbors, paths or path patterns containing at least a pathway ( $m = p$ ), a gene or a GO class ( $m = g$ ), or a MeSH class ( $m = m$ ). Such atomic filters can be combined to form disjunctive filters. For example, the $m = pg$ filter keeps neighbors, paths or path patterns containing at least a pathway or a gene or a GO class.

Table 4 summarizes the parameters that limit the number of features in the output matrix. Additional details about our method of feature construction are provided in [29].

Parameters used to limit the number of features

Each parameter is associated with a domain for its value

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol