# Also in the Article

Training the AIMNet model

Procedure

Network sizes (depth and number of parameters) were determined through hyperparameter searches conducted as multiple separate experiments and listed in table S4. Nonlinear exponential linear units (39) an activation function with α = 1.0 was used in all AIMNet model layers. Model training was done on four GPUs in parallel. Each batch of the training data contained an average of 380 molecules. The gradients from four batches were averaged, making an effective batch size of 1520 molecules, with molecules of different sizes. Amstrad (40) optimization method was used to update the weights during training. An initial learning rate of 10−3 was dynamically annealed with the “reduce on plateau” schedule: The learning rate was multiplied by 0.9 once the model failed to improve its validation set predictions within six epochs.

The cost function for multitarget multipacks training was defined as weighted mean squared error loss$Ltot=1N∑tT∑p∑iNwtwp(ytpi−ŷtpi)2$(1.1)where indices t, p, and i correspond to pass number, target property, and sample, respectively; wt and we are the weights for the iterative pass and target property, respectively; y and ŷ are target and predicted properties, respectively; and N is the number of samples. In the case of per-molecule target properties (energies), the y values in the cost function were divided by the number of atoms in molecule, so errors are per atom for all target properties. We used equal weights for every pass, e.g., wt = 1/T, where T is the total number of passes. Values for wp were selected in such a way that all target properties give approximately equal contribution to the combined cost function. In relative terms, the weights for molecular energies, charges, and volumes correspond to 1 kcal/mol, 0.0063e, and 0.65 Å3, respectively. We also found that training results are not very sensitive to the choice of the weights wp.

To accelerate training, the models were initially trained with t = 1. Then, the weights of the last layer of update network were initialized with zeros to produce zero atomic feature update and the model was trained with t = 3 passes. We also used cold restarts (resetting the learning rate and moving averages information for the optimizer) to archive better training results. For t = 1, the networks were trained for 500 epochs on average, followed by 500 epochs with t = 3 for a total of about 270 hours on a workstation with dual Nvidia GTX 1080 GPUs. Typical learning curves are shown in fig. S1.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A