Reinforcement learning

VG Víctor Gallego
RN Roi Naveiro
CR Carlos Roca
DI David Ríos Insua
NC Nuria E. Campillo
ask Ask a question
Favorite

Reinforcement learning (RL) [70], as opposed to supervised learning and similarly to unsupervised learning, does not require labelled data. Instead, an agent (driven by a RL model) takes actions sequentially while accumulating rewards from them. Its aim is learning a policy allowing an agent to maximise his total expected reward. As an example, assume for a moment that we have access to a predictive model (or an experiment) that, given a molecule, can predict a chemical property of interest, for instance biological activity, and give that molecule a score (the reward). Then, the RL agent consists of a generative model that as an action generates a molecule (this can be its SMILE code, a graph or any other representation of interest). This molecule is evaluated through the predictive model, receiving a reward, and is given to the generator as a feedback signal. We can iterate this loop many times, resulting in a generator that learns to produce molecules with a given chemical property (measured by achieving optimal reward). Figure 5 depicts an schematic view of this process.

An schematic view of RL

The predictive component can be a black-box model provided by chemical software, but can also be obtained using the methods in section "Supervised learning". In that case, the generator is typically optimised through methods from one of two RL families based on either Q-learning [71, 72] or policy gradients [73, 74], which directly improve the agent’s policy. In the previous example, the policy could be based on a generator model from section "Deep learning", such as a VAE or a GAN. However, the predictive model could also be refined during the RL loop, as with the actor-critic family of algorithms [75].

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A