In real-world applications like lead optimization, it is often desired to optimize several different properties at the same time. For example, we may want to optimize the selectivity of a drug while keeping the solubility in a specific range. Formally, under the multi-objective reinforcement learning setting, the environment will return a vector of rewards at each step t, with one reward for each objective, i.e. , where k is the number of objectives.
There exist various goals in multi-objective optimization. The goal may be finding a set of Pareto optimal solutions, or find a single or several solutions that satisfy the preference of a decision maker. Similar to the choice in Guimaraes et al.15, we adapted the latter one in this paper. Specifically, we implemented the “scalarized” reward framework to realize multi-objective optimization, with the introduction of a user defined weight vector , the scalarized reward can be calculated as
The objective of the MDP is then to maximize the cumulative scalarized reward.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.