2.1. Active Inference

Ozan Çatal
SW Samuel Wauthier
CB Cedric De Boom
TV Tim Verbelen
BD Bart Dhoedt
request Request a Protocol
ask Ask a question
Favorite

Any agent, either artificial or natural, can only perceive the surrounding environment (characterized by its hidden state h) through sensory observations, and changes its environment through actions. The implicit sensory blanket (Figure 1) separates external (environmental) states from the internal states of an agent –that entail a generative model of the external states. An agent's action at time step t will change the environment's hidden state ht according to some generative process R(o~,a~,h~) over sequences of observations o~, actions a~ and hidden states s~; and provide new observations ot to the agent. We will use tildes to designate sequences in the remainder of this paper. However, as the agent has no direct access to the hidden states of the environment, it can only develop its own internal belief states st that explain the perceived observations as well as possible, by means of a generative model.

An agent's Markov blanket. An agent has no direct access to the environment's hidden states h~. It can only perceive the consequences of its actions at by means of observations ot, and develop its own internal belief states st. Note that st does not necessarily have to match the environments hidden states ht. Tildes indicate sequences of observations, actions or hidden states.

More concretely, the agent's world model can be formalized as partially observable Markov decision process (POMDP), with the probability distribution P(o~,s~,a~,π), specifying the joint probability of the agent's observations, belief states, actions and policies. In this formalism a policy is nothing more than a sequence of actions at:T up until some time horizon T. Without loss of generality, we assume the world model is Markovian, so that the agent's state st at time step t is only influenced by the previous state st−1 and action at−1. Graphically this model can be visualized through time according to Figure 2A. Formally, it can be factorized as follows:

(A) The POMDP depicting an agent's model of the world up until the current timestep t. The current state st determines the current observation ot and is only influenced by the previous state st−1 and action at−1. Both actions and observations are assumed to be observed, indicated by a gray coloring, whereas states need to be inferred. (B) The agent's model of the world from timestep t onward. As with the model for the past we assume that for each timestep the observation is only influenced by the corresponding state. Note that for the future we assume the agent has control over which states to visit through its actions which are determined by a policy π.

Due to the separation between the agent and the environment through the Markov blanket, the agent is only able to infer the effects of its actions on the world through observations. This entails that the agent can only update its beliefs over world states through Bayesian inference on possible belief state values conditioned on observed actions and observations. In fact the agent tries to infer its belief state value s through the posterior belief P(s~|o~,a~). The actual posterior in this form, derived directly from Bayes rule is, in general, intractable to calculate directly from the given joint model in Equation (1). To avoid this, the agent resorts to variational inference (Beal, 2003), and approximates the true posterior by some approximate posterior distribution Q(s~|o~,a~), which is in a form that is tractable to the agent. Similar to the model posited in Equation (1), the approximate posterior distribution can be decomposed as:

In active inference, the agent is believed to be acting according to the free energy principle, which states that every agent's goal is to minimize its variational free energy. In view of our generative model, the variational free energy F is formalized as (Friston, 2013):

When rewriting the variational free energy using the second equality, it decomposes into two terms: the KL-divergence between the approximate and true posterior distribution, and the (negative) log evidence. This means that minimizing free energy is equivalent to maximizing the log evidence, while making the posterior approximation as good as possible. One can also see that the variational free energy is actually the negative evidence lower bound, which is maximized in variational inference (Bishop, 2006). Variational free energy can also be written as the third equality, which comprises a complexity and accuracy term. This states that the model should minimize the complexity of accurate explanations of the observations (Schwartenbeck et al., 2019).

Note the omission of π in Equation (3), as we assume the agent has a (perfect) proprioceptive feedback channel, i.e., all executed actions are observed. For time steps in the future this is not the case, and the agent will have to make inferences about which policy (and actions) to select.

The crucial aspect of active inference is that agents not only try to minimize their free energy for past observations, but also aim to minimize their free energy in the future. For future time steps, the actions are determined by a chosen policy π, and both actions and observations are no longer observed, but become random variables that have to be inferred, as shown in Figure 2B. In order to minimize the free energy in the future, the agent not only needs to form posterior beliefs over its current state, but also form beliefs over future states and observations when following certain policies. This allows the agent to evaluate the so-called expected free energy G for some policy π, and form a belief over possible policies P(π) (Schwartenbeck et al., 2019) as:

This means the agent picks or samples policies according to some softmax σ function with temperature γ over the total expected free energy. Policies that exhibit low total expected free energy will have a higher likelihood to be sampled. The above equations denote a crucial aspect of active inference, namely that the only self-consistent prior belief over policies P(π) is to believe that the agent will follow policies that minimize the expected free energy (Friston K. et al., 2015).

The expected free energy is defined as the sum of the expected free energy of a policy over all timesteps that we look ahead in the future:

with

Note that we set P(sτ|π) = P(sτ), which reflects that the agent has a prior preference over which states to visit (Friston K. et al., 2015). One can interpret this as the agent having prior beliefs over states that it will visit, independent of a policy, but driving policy selection to these attractor states. An obvious example of a preferred state is for example maintaining a temperature of 37 °C (Van De Laar and De Vries, 2019). These preferred states basically determine the agent, and can be endowed on the agent either by nature through evolution, in the case of natural agents, or by humans in the case of artificial agents.

Two important parts emerge from Equation (6): the KL-divergence between the approximate posterior distribution over future states and their corresponding prior belief, called the risk, and the expected entropy over future observations, also known as the ambiguity (Friston et al., 2016). These terms illustrate the way an active inference agent will act. On the one hand the agent will try to match the states it visits in the future with its prior belief over future states, hence realizing preferences or exhibiting goal-directed behavior (Schwartenbeck et al., 2019). On the other hand the agent will try to reduce the conditional entropy on future observations, or avoid ambiguous states.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A