2.3. Algorithms for Bayesian Network Learning

FD Federico Delussu
FI Faisal Imran
CM Christian Mattia
RM Rosa Meo
request Request a Protocol
ask Ask a question
Favorite

In this section, we outline some of the algorithms to learn causal models from the observed data. Learning a Bayesian network occurs in two steps: structure learning and parameter learning. Suppose that learning a BN with DAG G and parameters Θ from a data set D having n observations is driven by the following:

Structure learning is involved in learning P(G|D): it aims to find the DAG G that incorporates the dependence structure between the variables of the data D. In contrast, parameter learning is focused on P(Θ|G,D) and consists of estimating the parameters Θ given G. Suppose that the parameters are independent in distributions; then, they can be learned in parallel for each node Xi as follows:

where, with Pai, we represent the set of parent nodes of Xi (connected with a directed edge, incoming in Xi) and, with ΘXi, we represent the set of parameters of the conditional distribution of Xi given its parents Pai in G. Learning the structure of BN is an NP-hard problem and computationally challenging. Suppose that there are N nodes; then, the possible arcs are N(N1)/2 and the number of DAGs grow super-exponentially as the number of nodes N increases. Hence, only a small number of the possible alternative DAGs can be explored in a reasonable time. There are three main possible approaches used in the structure learning of the BN: score-based, constraint-based, and hybrid. Each based on a different statistical criterion.

Score-based approach is a general class of optimization techniques to learn BN structure. Each learned BN is assigned a network score based on its Goodness-of-Fit; the algorithm then tries to maximize the network score. Score-based approach examples include simulated annealing, greedy search [27], genetic algorithms [28], and hill climbing (HC) [19].

Constraint-based approach first identifies pairs of nodes (Xi,Xj) that are connected with an arc, regardless of its orientation. These nodes cannot be separated by other subsets of nodes; this is tested heuristically using a conditional independence test. The algorithm then distinguishes the v-structure among all of the pairs of non-adjacent nodes Xi and Xl with a common neighbor Xj using the separating sets found earlier and sets the remaining arc directions using the rules from Chickering [20] to obtain CPDAG (completed partially directed acyclic graph). Some examples include Grow-Shrink [29] and Interleaved Incremental Association (Inter-IAMB) [30].

Hybrid approaches are constraint-based and use restriction to reduce the candidate space of DAGs; they are score-based and use maximize implementations to find the optimal DAG in the restricted space by implementing any combination of constraint-based and score-based algorithms. Hybrid approaches include Max-Min Hill Climbing algorithm (MMHC) [18], Restricted Maximization (RSMAX2) [31], and Hybrid HPC (H2PC) [32].

The hill climbing algorithm belongs to the class of greedy search algorithms. Hill climbing (HC) assigns a network score (Goodness-of-Fit) to the candidate BNs, and heuristic algorithms strive to maximize the network score, since a higher value means a better fit. HC starts from a DAG structure, and then it adds, reverses, and deletes arcs until the network score no longer improves [19]. The network score can be the Bayesian Information Criterion [33] (BIC) or Akaike Information Criterion [34] (AIC) for both discrete and continuous data sets.

The restrictive maximization algorithm belongs to the class of hybrid approaches. RM achieves faster structure learning by restricting the search space and by implementing a combination of constraint-based and score-based algorithms [31].

In this work, we introduce the brute force algorithm to afford the computational complexity of complete exploration of the search space of the possible BN alternatives. We take advantage of the parallel computing technology provided by HPC4AI (Turin’s High-Performance Centre for Artificial Intelligence https://hpc4ai.it/, accessed on 8 July 2021). The brute force formalization and implementation is one of the original contributions of this work. We split up the search space for model selection and assign each to an independent processor that delivers the best BN of the corresponding subspace. Finally, these results are compared to choose the very best model. Each candidate BN is assigned with a network score “Goodness-of-fit”. The brute force algorithm returns a BN with the maximum score since a higher score means a better fit. We used a score derived from the Bayesian information criterion (BIC) as implemented in the R Library [35]; this network score is suitable for both continuous and discrete data sets.

The idea of the brute force algorithm is to partition the space for all possible Bayesian networks and to allocate each partition to a different processor, such that each processor in parallel executes the task to evaluate the BIC score of all networks in its partition. Each network is represented as a vector—a binary configuration of as many bits as the possible arcs in the networks. Each bit in the vector represents whether the corresponding arc is present or absent in the network.

The algorithm starts with an input data set D containing N variables. AllArcs is a matrix (p × 2) with p=N(N1)/2 being the number of possible (undirected) arcs. Each row in the matrix represents an arc (from–to): the first column represents the starting node, and the second represents the ending node. Each pair of nodes is identified by a matrix row index from 1 to p.

k1< p is the number of arcs that are actively considered by each processor, and the processor is free to vary in anyway in combination with the remaining arcs that instead are fixed. The different processors have a different configuration in terms of present/absent arcs that are fixed in the remaining subset of pk1 arcs. FixedArcsPresence is a vector of length pk1 containing information related to the arcs for which the presence/absence is fixed for that processor. pk1 is the prearranged arcs (or pairs of nodes). In total, we have 2pk1 available processors. Each processor runs the brute force Algorithm 1, with FixedArcsPresence as an input argument. FixedArcsPresence is a vector of the ordered list representing the presence/absence of each of the prearranged pk1 arcs. Each element in this vector corresponds to a different nodes pair with values 0,1 such that FixedArcsPresence[i]=1 if the ith pair of nodes is considered by that processor to be connected; otherwise, it is 0. The processors are executed in parallel, where each processor has a different realization of FixedArcsPresence. For each total configuration of arcs present or absent, from the fixed part and the variable part, the processor evaluates the BIC score of the corresponding Bayesian network with the goal of finding the one with the maximum value. Regarding the determination of the arcs’ directions, defined within the algorithm, it establishes whether each arc is oriented according to the direction taken as the reference in such a matrix or the other way around and evaluates the BIC score for both arc directions. At the end, the maximum score among the scores found by the processors is selected and so is the corresponding Bayesian network.

However, some care should be taken when Bayesian networks are learnt from the data. It should be kept in mind that networks learned from observational data may establish some relationships that are hard to explain based on our prior knowledge of the domain. Some relationships may reveal aspects of phenomena that we did not expect, some may be explained by introducing exogenous variables acting as confounders, and the influence of variable Xi on another variable Xj may be mediated by an unobserved variable Xl that is not included either in the model or in the available data. Moreover, we should take into account that the model, due to a lack of flexibility, could be unable to accurately describe the phenomenon. For example, the assumption of Gaussianity may be inadequate for our data, and adapting the variables to multinomial assumption through discretization may lead to mutual information loss.

Therefore, we cannot expect to find a rational justification for each connection, but we can apply critical thinking to extract helpful insights based on what the data supports.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A