Both the actor and critic LSTM networks consisted of 128 units each and were implemented using TensorFlow’s Keras API. The weight matrices Uq were initialized using Keras’s ‘glorot_uniform’ initializer, the weight matrices Wq were initialized using Keras’s ‘orthogonal’ initializer and the biases bq were initialized to 0. The output and memory states for both LSTM networks were initialized to zero at the beginning of each training or testing episode.
Input to the critic was identical to the smoothed, single-trial input used for the synaptic plasticity model described above, except i) activity was not interpolated because each time step in this model was equivalent to the sampling rate of the collected data (10 Hz), and ii) we chose to input only the activity from 2s before to 2s after the lever press (as compared to 3s after the lever press for the synaptic plasticity model) in order to reduce the computational complexity of the training process. To reduce episode length, and therefore training time, we also excluded those neurons whose peak activity occurred more than 2s after the lever press, reducing the final number of ‘pseudoneurons’ used as input to 306 (compared with 368 for the synaptic plasticity model).
Optogenetic-like stimulation of the PL-NAc population (Figures 7D and and7E)7E) was performed in a similar manner to the synaptic plasticity model, with activity set to 0.15 for a randomly selected 70% of neurons for the duration of the trial.
Each trial was 4s long starting at 2s before lever press and ending at 2s after lever press. At any given time, the model has three different choices: choose left, choose right or do nothing. Similar to the synaptic plasticity model, the model makes its decision to choose left or right at the start of a trial, which then leads to the start of the corresponding choice-selective sequential activity. However, unlike the synaptic plasticity model, the model can also choose ‘do nothing’ at the first time step, in which case an activity pattern of all zeros is input to the critic for the rest of the trial. For all other time steps, the correct response for the model is to ‘do nothing’. Choosing ‘do nothing’ on the first time step or choosing something other than ‘do nothing’ on the subsequent time steps results in a reward r(t) of −1 at that time. If a left or right choice is made on the first time step, then the current trial is rewarded based on the reward probabilities of the current block (Figure 1A) and the reward input r(t) to the critic is modeled by a truncated Gaussian temporal profile centered at the time of the peak reward (Equation 20) with the same parameters as in the synaptic plasticity model.
We used a slightly modified version of the reversal learning task performed by the mice in which the block reversal probabilities were altered in order to make the block reversals unpredictable. This was done to discourage the model from learning the expected times of block reversals based on the number of rewarded trials in a block and to instead mimic the results of our behavioral regressions (Figure 1E) suggesting that the mice use only the previous ~4 trials to make a choice. To make the block reversals unpredictable, the identity of the high-probability lever reversed after a random number of trials drawn from a geometric distribution (Equation 1) with p = 0.9.
Each training episode was chosen to be 15 trials long and the model was trained for 62,000 episodes. For this model, we used a time step Δt = 0.1s. The values of the training hyperparameters were as follows: the scaling factor of the critic loss term βv = 0.05, the scaling factor of the entropy regularization term βe = 0.05, the learning rate α = 0.01s−1(α = 0.001 per time step), and the timescale of temporal discounting within a trial τ = 2.5s, leading to a discount factor for all times except for the last time step of a trial when the discount factor was 0 to denote the end of a trial. The network’s weights and biases were trained using the RMSprop gradient descent optimization algorithm (Hinton et al., 2012) and backpropagation through time, which involved unrolling the LSTM network over an episode (630 time steps).
Block reversal probabilities for the testing phase were the same as in the probabilistic reversal learning task performed by the mice. The average block length for the PL-NAc neural dynamics model was 19.3 ± 5.0 trials (mean+/−std. dev.).
The model’s performance (Figures 6B–6J) was evaluated in a testing phase during which all network weights were held fixed so that reversal learning was accomplished solely through the neural dynamics of the LSTM networks. The network weights used in the testing phase were the weights learned at the end of the training phase. A testing episode was chosen to be 1500 trials long and the model was run for 120 episodes.
For Figures 6G–6J, we tested the model’s performance on a slightly modified version of the reversal learning task in which, after training, block lengths were fixed at 30 trials. This facilitated the calculation and interpretation of the block-averaged activity on a given trial of a block. Dimensionality reduction of the actor network activity (Figure 6H) was performed using the PCA function from the decomposition module in Python’s scikit-learn package.
In Figure 6F, we analyzed how model performance changed when the temporal structure provided by the choice-selective sequential inputs to the critic were replaced during training by persistent choice-selective input. The persistent choice-selective input was generated by setting the activity of all the left-choice selective neurons to 1 and all the right-choice selective neurons to 0 for all time points on left-choice trials and vice versa on right-choice trials.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.