Reinforcement learning

Víctor Gallego; Roi Naveiro; Carlos Roca; David Ríos Insua; Nuria E. Campillo

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Reinforcement learning

VG Víctor Gallego

RN Roi Naveiro

CR Carlos Roca

DI David Ríos Insua

NC Nuria E. Campillo

This method is extracted from research article: Mol Divers, Jul 2021

AI in drug development: a multidisciplinary perspective

DOI: 10.1007/s11030-021-10266-8

Ask a question

Favorite

Reinforcement learning (RL) [70], as opposed to supervised learning and similarly to unsupervised learning, does not require labelled data. Instead, an agent (driven by a RL model) takes actions sequentially while accumulating rewards from them. Its aim is learning a policy allowing an agent to maximise his total expected reward. As an example, assume for a moment that we have access to a predictive model (or an experiment) that, given a molecule, can predict a chemical property of interest, for instance biological activity, and give that molecule a score (the reward). Then, the RL agent consists of a generative model that as an action generates a molecule (this can be its SMILE code, a graph or any other representation of interest). This molecule is evaluated through the predictive model, receiving a reward, and is given to the generator as a feedback signal. We can iterate this loop many times, resulting in a generator that learns to produce molecules with a given chemical property (measured by achieving optimal reward). Figure 5 depicts an schematic view of this process.

An schematic view of RL

The predictive component can be a black-box model provided by chemical software, but can also be obtained using the methods in section "Supervised learning". In that case, the generator is typically optimised through methods from one of two RL families based on either Q-learning [71, 72] or policy gradients [73, 74], which directly improve the agent’s policy. In the previous example, the policy could be based on a generator model from section "Deep learning", such as a VAE or a GAN. However, the predictive model could also be refined during the RL loop, as with the actor-critic family of algorithms [75].

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol