Graph-based feature construction

EB Emmanuel Bresso
PM Pierre Monnin
CB Cédric Bousquet
FC François-Elie Calvier
NN Ndeye-Coumba Ndiaye
NP Nadine Petitpain
MS Malika Smaïl-Tabbone
AC Adrien Coulet
request Request a Protocol
ask Ask a question
Favorite

Graph canonicalization In knowledge graphs, several nodes may co-exist, while representing the same entity. For example, a unique drug in PGxLOD can be represented by two nodes: one from PharmGKB and another from DrugBank. Each of these two nodes have their own connections to other nodes in the knowledge graph. Therefore, the union of their edges constitutes all the available knowledge about the drug they both represent. As they represent the same drug, they are connected through an owl:sameAs edge. In order to avoid traversing such edges, we canonicalize the knowledge graph, that is to say, nodes linked by owl:sameAs edges are grouped into a unique node as a pre-processing step, before building graph-based features. Such a procedure corresponds, in graph theory, to the contraction of owl:sameAs edges.

Paths, path patterns, and neighbors as drug features

From an arbitrary set of drugs D (such as DILI DILI) and a canonicalized knowledge graph (such as PGxLOD), a set of graph-based features is built on D. We distinguish three kinds of features: paths, path patterns and neighbor nodes. First, we build paths rooted by drugs from D. The neighborhood of each drug dD is explored in the graph with a max distance of k, generating sequences of predicates and nodes of max length k. Accordingly, if a path dp1n1p2n2 is found in the knowledge graph (with k=2), the drug d is associated with the feature p1n1p2n2 in the output matrix.

Second, we build path patterns that generalize paths by considering ontology classes instantiated by nodes. The aim of path patterns is to offer more general descriptions, which have more chances to be shared by several entities. For example, if n1 instantiates C1 and n2 instantiates C2, we add the following path patterns: p1C1p2n2,p1n1p2C2,p1C1p2C2,p1p2n2,p1n1p2, and p1C1p2,p1p2C2, andp1p2. It is noteworthy that, to allow a high level of generalization, we always consider the top-level class in the generalization procedure. To leverage hierarchies organizing ontology classes, a node n is only generalized by ontology classes at a distance of at most t from itself, following instantiation and subsumption edges.

Third, we list neighboring nodes, i.e., any node that can be reached from dD within a distance of max k. We do not keep track of the distance between d and its neighbors: if a node n can be reached from a drug d1 after 2 hops and from a drug d2 after 3 hops, these two drugs will be associated with the neighbor n in the output matrix. Interestingly, a neighbor n can be represented by the general path pattern (){0,k-1}n.

Features can potentially be noisy and numerous, leading to a combinatorial explosion of their number. That is why, we add several constraints to only keep the most interesting features. First, we only keep the most specific paths and path patterns, considering that a node is more specific than the classes it instantiates and a class is more specific than its superclasses. A path or path pattern P1 is more specific than another path or path pattern P2 if each node/class in P1 is more specific than the node/class at the same position in P2. Thus, if p1n1p2n2 is associated with the exact same drugs as p1C1p2C2, then the path pattern is removed and only the path constitutes a feature. Similarly, if p1C1p2C2 is associated with the exact same drugs as p1p2, then we only keep the first one. Second, the exploration is stopped at nodes whose degree is greater than a parameter deg. Such nodes are usually called hubs [27], and expanding a path ending at a hub may generate a large number of new paths associated with the same drugs. Third, we consider minimal and maximal supports of features (denoted smin and smax) as additional filters. The support of a feature consists in the number of drugs associated with a feature, i.e., neighbors of a same entity or rooting a path or a path pattern. Only features associated with more than smin drugs and less than smax drugs are output in the final matrix. Fourth, two blacklists avoid traversing noisy or unwanted edges. Edges are not traversed if their predicate is blacklisted (in bpredicates) or if the adjacent node instantiates (directly or indirectly) a blacklisted class (in bexp-types). For example, we blacklist predicates of PROV-O that are used to describe provenance metadata. By blacklisting rdf:type in bpredicates, we make sure the exploration is performed through entities and not classes of PGxLOD. Relations between classes are only considered when generalizing paths. Aside from noisy features and combinatorial explosion, we use these blacklists to prevent considering features that “obviously” carry the signal we are trying to predict. For example, we blacklist all “side-effects” links from SIDER, which may directly link drugs in D with the side effect we aim at predicting. We also blacklist all classes from MeSH related to SCAR or DILI to avoid taking into account nodes instantiating them. A third blacklist (bgen-types) avoids generalizing nodes in paths by blacklisted classes. This is particularly useful to withdraw general classes (e.g., Drug) that increase the number of generated path patterns while not adding useful information.

Besides previous topological constraints, we also perform a domain-driven filtering configured by parameter m. Indeed, in our objective of explaining ADRs, experts highlighted that features that may be explanatory to them mention pathways, genes, GO, and MeSH terms. For this reason, we propose three post-processing atomic filters, only keeping neighbors, paths or path patterns containing at least a pathway (m=p), a gene or a GO class (m=g), or a MeSH class (m=m). Such atomic filters can be combined to form disjunctive filters. For example, the m=pg filter keeps neighbors, paths or path patterns containing at least a pathway or a gene or a GO class.

Table 4 summarizes the parameters that limit the number of features in the output matrix. Additional details about our method of feature construction are provided in [29].

Parameters used to limit the number of features

Each parameter is associated with a domain for its value

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A