We implemented REINVENT 3.2,37 a reinforcement learning framework,12,13 which optimizes a SMILES-based, recurrent neural network (RNN) agent to generate novel molecules with targeted properties. The basics of the REINVENT algorithm consist of four parts, a prior network, an agent network, reinforcement learning loops, and scoring functions. Initially, a sequence-generating recurrent neural network is trained on the SMILES representation of over a million molecules from the ChEMBL database38 to generate SMILES strings. To start each reinforcement learning loop, a second RNN capable of SMILES generation is initialized with the trained prior network as the agent network (unless stated otherwise). In each iteration of the reinforcement learning loop, molecules are sampled from the agent network as SMILES strings and scored according to user-defined scoring functions. The agent network is then optimized to generate molecules with desired properties according to the augmented likelihood. The augmented likelihood combines scored values and prior likelihood. More details are provided in Section S2.† This prevents the agent network deviating from the prior network while learning to generate new molecules with higher scores. The property targets are achieved through defining scoring functions, which are, in general, on the continuous interval [0,1]. The agent network is rewarded when the total score is close to one and penalized when the score is closer to zero. Functionally, we take the geometric mean of each score so the total score remains between 0 and 1.
To enable REINVENT to learn the complex structure–property relationship for finding ideal SF and TTA molecules, we used a two-step curriculum learning strategy26 as shown in Fig. 1 to train an agent network first on easier objectives and then later on more challenging objectives. In the first step of the curriculum learning strategy, we limited the available chemical space of the agent network such that it is only allowed to generate small and rigid organic molecules, due to these molecules being more likely to be suitable for SF and TTA materials. At this step, the total scoring function (which is derived from the following cheminformatics-calculated properties) is found, with each property resulting in an individual subscore of zero or one:
(1) Molecular weight <400 Dalton
(2) No consecutive rotatable bonds exist
(3) Synthesizability (SCScore39 ≤4)
(4) No matches with a list of forbidden substructures
In this first stage, the total score function is discrete, resulting in a total score of one if all of the criteria are satisfied and zero if any are not satisfied.
It has been shown that goal-directed generation methods are prone to generate unsynthesizable molecules with high scores.40,41 Here, we used the SCScore model developed by Coley et al.39 to assign synthetic complexity score between 1 and 5 to molecules sampled from the agent network, where a higher SCScore corresponds to higher synthetic complexity. Though all synthetic accessibility scoring functions will be influenced by biases that exist in published chemical reactions, the SCScore model is trained on ca. 12 million reactions from the Reaxys database with the assumption that the products from published reaction data should be synthetically more complex than corresponding reactants. Thus, we determined the highest-allowed SCScore during the first step after comparing between SCScore for molecules sampled from REINVENT and the calibration set for the time-dependent DFT (TD-DFT) benchmark in Fig. S1.†
Without synthesizability constraints, a significantly large number of molecules saturate the highest (most complex) SCScore (Fig. S2†). Since we are only interested in small molecules (molecular weight < 400 Dalton) and eventually the generated molecules need to be solved by retrosynthetic planning tools within three steps (Section 2.2), using SCScore is used as a first-step screening criterion before running excited-state calculations.
In addition, we added a list of forbidden substructures (partially motivated by initial substructures generated in our first implementations of the workflow) that are known to be unstable (e.g., unrealistic aromatic or anti-aromatic substructures). If the agent network generates any structures from this forbidden substructure list, it is penalized. The forbidden substructures are given in our implementation on the GitHub repository and in Section S2.† In the exploitative case, the agent network is also be penalized if it fails to generate molecules with the specified substructure. After learning in the first step, the resulting agent network becomes the starting agent network for the second step.
The total scoring function in the second step of the curriculum learning strategy includes excited-state energy gaps in addition to the cheminformatics criteria shown above. The design of scoring functions for excited-state energy gaps are based on evaluating
and
, which consider the exothermicity of the SF and TTA pathways (eqn (1)), prevention of energy loss from T1 to T2 upconversion, and uncertainty of TD-DFT calculations (See Fig. S2† for the continuous scoring transformation). The total score S(X) for a given SMILES string X, is the geometric mean of the excited-state energy gaps when other criteria are all satisfied.
where pS1T1 and pT2T1 are the scores for the S1/T1 and T2/T1 excited-state energy gaps mentioned above, and qi are the cheminformatics criteria. It prevents the total score from being diluted by other cheminformatics criteria and saves computational costs for the TD-DFT calculations by only considering the geometric mean of excited-state energy gaps for the total score if cheminformatics criteria are satisfied.
To ensure the diversity of the potential SF/TTA molecules generated at the second step, we used the carbon skeleton diversity filter from the work by Blaschke et al.42 The diversity filter keeps track of the number of sampled molecules with the same carbon skeleton derived from the Bemis–Murcko scaffold43 in the same bucket. The diversity filter starts penalizing molecules with the same carbon skeleton once the bucket for that specific carbon skeleton is full. Inception13 was allowed during the reinforcement learning process, which is an experience replay that can accelerate the learning in the early phase by replaying previous top-scoring sampled molecules to the agent network.
Further implementation details including hyperparameters of the workflow, as tailored for this materials design project, are provided in ESI,† Section S2.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.