At the beginning of every trial the cart-pole system is initialized with a random pole angle drawn uniformly from [−7.5, 7.5] degrees, positioned at the centre of the track with both velocities (cart and angular pole velocity) set to zero.
The pole has to remain within the green circular segment (±60 degree, Fig. 1a) while the cart must not leave the track (±5 m). This has to be achieved by applying forces of up to 4 N from either side to the cart using the input device. The forces accelerate the cart, depending on the direction, to the left or right by which the pole can be balanced. A trial was considered successful, if balance was maintained for 30 seconds without violating any constraint. Hence, trials were at maximum 30 seconds long. Violation of one of the constraints (cart position, pole angle) also terminates the trial. The number of trials was not limited, instead we limited the duration of the experiment (see below). Before the next trial begins, feedback about the violated constraint and the duration of the trial is provided.
In addition to the terminal feedback at the end of every trial, subjects also receive cumulative reward during the trial, which is displayed as number in the cart and updated in every frame (Fig. 1a). The theoretically maximum reward per second is 10 points, which can be achieved by holding the pole perfectly vertical, keeping the cart exactly in the centre of the track while not applying any force to the cart (for details, see S1 Appendix). Thus, in theory, a maximum reward of 300 points per trial is achievable. This ultimately means that subject has to keep the system within the constraints for 30 seconds, i.e. balance the pole on the cart for 30 seconds without leaving the track. Subjects were instructed about these factors and were asked to maximize the reward in every single trial.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.