The stimulus set consisted of a set of object images and audio recordings of a human voice uttering their corresponding German names. The image set comprised 12 silhouette color photographs of everyday objects on a gray background (Figure 1A). In addition to these 12 objects, an image of a paper clip was used as a target stimulus in catch trials of the perception task (see below). The audio recordings were 12 spoken German words taken from a German standard dictionary website (Duden, https://www.duden.de), with each word corresponding to one of the object images. Each recording was digitized at a 44.1 kHz sampling rate and normalized by their root mean squared amplitude. The average duration of the sound recordings was 554.3ms (SD: ± 17.8ms).
The experiment consisted of two identical recording sessions, performed on two different days. Within each session, participants first completed the perception task (Figure 1B) and then the mental imagery task (Figure 1C). Additionally, they completed a third, auditory task, which was related to a different research question, and is not reported in the current manuscript. Experimental stimuli were delivered using Psychtoolbox [79].
In the perception task, participants viewed the object images. On each trial, one of the object images (∼2.9° visual angle) was presented for 500 ms at the center of the screen, overlaid with a black fixation cross. Participants were instructed to press a button and blink their eyes when the image of the paper clip appeared (on average every 5th trial). Trials were separated by an inter-trial interval (ITI) of 300 ms, 400 ms, or 500 ms, during which only the fixation cross was presented. Participants were instructed to maintain central fixation throughout the experiment. Following catch trials, the ITI was lengthened by 1000 ms to avoid contaminating the subsequent trial with motor artifacts. In each recording session, participants completed 600 trials of the perception task, split into two blocks separated by a self-paced break.
In the mental imagery task, participants were presented with the audio recordings of the words and were asked to actively imagine the object corresponding to the word they had heard. Each trial started with a red fixation cross, 500 ms after which the audio recording of an object name was played. Participants were instructed to visually imagine the corresponding object image as soon as they heard the object name for 2,500 ms. After the imagery period, participants indicated whether the vividness of their mental image was high or low by selecting one of two letters (H versus L) on a 1,500 ms response screen. The positions of the response options were counterbalanced across trials. Participants indicated high vividness of their imagery in the majority of trials (83.4%, SD: ± 0.09), indicating that, subjectively, participants formed precise mental images of the objects. Trials were separated by an ITI of 300 ms, 400 ms or 500 ms, where a black fixation cross was presented. In each recording session, participants completed 480 trials of the imagery task, split into four blocks interrupted by self-paced breaks.
To familiarize participants with the mental imagery task and to make sure they could vividly imagine the objects, we trained them in imagining our object images prior to the mental imagery task. During this training procedure, participants practiced imagining the 12 objects after hearing audio recordings of their names. On each trial of the training procedure, participants first viewed one of the object images for as long as they wished, in order to familiarize themselves with the image. After they were confident that they could imagine the object, they proceeded to the 2,500 ms imagery period, where they first heard the audio recording and then imagined the object. After this imagery period, participants again viewed the object image for as long as they wished, in order to self-evaluate the correctness of their imagery. After each object was trained once (i.e., after 12 trials), participants entered a two-alternative forced-choice test procedure, where on each trial (one for every object, i.e., 12 trials) an object image from the stimulus set was presented alongside with a very similar foil image not from the stimulus set. Foil images were drawn randomly from a set of 3 alternatives. If participants achieved 80% correct in this test, they proceeded to the main experiment. If participants failed to achieve 80% correct, the training procedure and the subsequent test were repeated until the participant reached 80% correct.
EEG data was recorded using an EASYCAP 64-channel system and Brainvison actiCHamp amplifier. The 64 electrodes were arranged in accordance with the standard 10-10 system. Acquisition was continuous with a sampling rate of 1000 Hz and the EEG data was filtered online between 0.3 and 100 Hz. All electrodes were referenced online to the Fz electrode. Offline preprocessing was carried out using Brainstorm [80]. Eyeblinks and eye movements were detected and removed with an independent component analysis on frontal electrodes Fp1, Fp2, AF7 and AF8 in the 64-channel EASYCAP system as implemented in the ‘SSP: Eye blinks’ (Signal-space projection) algorithm in Brainstorm. We visually inspected the components and removed those resembling the spatiotemporal properties of eyeblinks and eye movements. The number of components removed was between one and four for each participant, and a clear eye-blink component was always found and removed. To avoid edge artifacts in the subsequent time-frequency decomposition, the continuous EEG raw data was extracted in epochs between 600 ms pre-stimulus and 1100 ms post-stimulus in the visual perception task and between 600 ms pre-stimulus and 3100 ms post-stimulus in the mental imagery task. For the main analysis data were time-locked to the onset of the visual image in the perception task and to the onset of the auditory word in the imagery task; time-locking the imagery data to the offset of each word yielded qualitatively similar results in the key analyses described below (Figures S1Q–S1S). The epoched data was baseline-corrected by subtracting the mean of the pre-stimulus interval, separately for each channel and trial.
EEG data recorded for the visual perception task and for the mental imagery task were analyzed separately. To recover induced oscillatory responses, the data was convolved with complex Morlet wavelets (constant length of 600 ms, logarithmically spaced in 20 frequency bins between 5 Hz and 31 Hz), separately for each trial and each sensor. By taking the square root of resulting time-frequency coefficients, we obtained the absolute power values for each time point and each frequency between 5Hz and 31 Hz. These power values were normalized to reflect relative changes (expressed in dB) with respect to the pre-stimulus baseline (−500 ms to −300 ms relative to stimulus onset). To increase the signal-to-noise ratio of all further analyses, we downsampled the time-frequency representations to a temporal resolution of 50 Hz (by averaging data in 20 ms-bins) and aggregated the 20 frequency bins into three discrete frequency bands (which we analyzed separately): theta (5-7 Hz, 5 bins), alpha (8-13 Hz, 6 bins) and beta (14-31 Hz, 9 bins).
To uncover shared representations between perception and imagery, we trained classifiers to discriminate pairs of objects from EEG data recorded during one task (i.e., perceiving an apple versus perceiving a car) and tested them on EEG data recorded for the same two objects in the other task (i.e., imagining an apple versus imagining a car). Above-chance classification performance in this cross-task procedure indicates that similar representations are evoked by imagining and perceiving objects. Classification was performed in a time- and frequency band-resolved fashion, that is separately for each frequency band and each time point. This allowed us to quantify (1) which frequency bands mediate these shared representations, and (2) with which temporal dynamics these representations emerge.
The detailed steps of the procedure are as follows. First, the data for each trial, each frequency band, and each time point was unfolded into a single pattern vector. For this, the data was averaged across frequencies contained in the frequency band (e.g., for the 6 frequency bins between 8 and 13 Hz for the alpha band), yielding a 63-element pattern vector (i.e., one value for each electrode). Note that results did not depend on the particulars of how data was aggregated in the frequency domain: A control analysis in which we, instead of averaging across the frequency bins in each band, concatenated the data across all frequency bins (e.g., 6 frequency bins × 63 electrode pattern vectors for the alpha band) yielded qualitatively equivalent results (Figures S1G–S1L).
Second, we created four pseudo-trials for every condition by averaging pattern vectors across trials where the same object was shown in the same task: for example, this resulted in four pseudo-trials for the apple in the imagery task, each constituting the average of 25% of the available trials (assigned randomly).
Third, we trained and tested linear support vector machines (C-SVC with a linear kernel and a cost parameter of c = 1, as implemented in the libsvm package [81]) using those pseudo-trials. This classification was performed across tasks: For each pairwise combination of objects, we trained classifiers to discriminate the objects using the four pseudo-trials in one task (e.g., the perception task). Then we tested these classifiers on the same two objects using data from the four pseudo-trials in the other task (e.g., the imagery task). Classification was repeated across both train-test directions (i.e., train on perception and test on imagery data, and train on imagery and test on perception data) and across all pairwise object combinations, and classifier performance (i.e., classification accuracy) was averaged across these repetitions. Averaging was performed along the “perception” and “imagery” axes of both analysis variants, so that a successful generalization from perception at 200ms to imagery at 800ms ended up at the very same point in the time generalization matrix, independently of the train-test direction. Results were consistent across both train-test directions (Figures S1O and S1P). Finally, the whole classification analysis was repeated 100 times, with new random assignments of trials into pseudo-trials, and results were averaged across these 100 repeats.
Importantly, as the temporal dynamics of cortical responses to perceived and imagined objects are not expected to be identical (e.g., responses during imagery could be delayed, slowed or reversed), we performed classification analyses in a time-generalization fashion [8]. That is, we did not only train and test classifiers on the same time points with respect to stimulus presentation, but we trained and tested classifiers on each combination of time points from the perception task (i.e., from 0 to 800 ms with respect to image onset) and the imagery task (i.e., from 0 to 2,500 ms with respect to sound onset). The analysis thus yielded time generalization matrices that indicate how well classifiers trained at one particular time point during perception perform at each time point during imagery (and vice versa). The resulting time-generalization matrices thereby yielded a full temporal characterization of shared representations between perception and imagery, separately for each of the three frequency bands (Figures S1A–S1C and S1D–S1F for alternative data aggregation method).
In addition to the cross-task classification analysis, we also performed a within-task classification analysis where we classified objects from EEG data recorded within one task, i.e., solely for the perception task or solely for the imagery task, again separately for each frequency band. This analysis was carried out in the same way as the cross-classification analysis (see above) with a leave-one-pseudo-trial-out cross-validation scheme: We trained classifiers to discriminate two objects using data from three of the four pseudo-trials and then tested these classifiers using data from the remaining, the fourth pseudo-trial. Classification was repeated 100 times, with new random assignments of trials into pseudo-trials, and results were averaged across these 100 repeats. For the within-task classification analyses, we yoked training and testing times, leading to a time course of classification accuracies for each frequency band and task (Figures S1H–S1M).
In the main analyses we chose a pre-defined, canonical range of frequencies to define the alpha band (8-13 Hz). However, peak alpha frequencies may vary between participants [78], suggesting that participant-specific alpha band should be defined separately for each participant. To determine the role of varying individual alpha frequencies on our analysis, we performed the cross-classification analysis based on each participant’s individual peak alpha frequencies and respective alpha band definitions. We defined participant-specific peak frequencies and respective bands using the following procedure. We first computed object classification on data from the perception task only, considering data at each frequency between 8 and 13 Hz with 1 Hz resolution and its two immediate neighbor frequencies (i.e., for 9 Hz including 8 and 10 Hz). For each participant, the peak alpha frequency was the frequency where within-task object classification accuracy was highest. The respective participant-specific frequency band was defined as the peak frequency and its two immediate neighbor frequencies (i.e., for peak frequency at 8 Hz the band is 7–9 Hz). We then repeated the cross-classification analysis using these participant-specific alpha frequencies. This yielded qualitatively similar results to the analysis based on the canonical alpha frequency band (see Figures S1T–S1V).
To determine whether cross-classification is enabled by large scale net increases or decreases in alpha power, we performed an additional analysis, in which we binned trials in the perception task according to whether they exhibited an increase or a decrease in alpha power, relative to baseline. We then re-performed the cross-classification analysis using only data from the perception task that either showed an alpha power enhancement (45% of trials) or an alpha power suppression (55% of trials). We equalized the number of trials by subsampling the alpha suppression trials to avoid bias. This analysis revealed no significant differences between alpha-enhanced or alpha-suppressed trials (Figure S1W).
To investigate whether alpha-band representations shared between perception and imagery are related to parieto-occipital alpha or frontal alpha mechanisms, we conducted separate cross-classification analyses using either the anterior or the posterior halves of electrodes in our EEG montage. The anterior half consisted of the 35 electrodes located on the frontal, temporal and central parts of scalp, covering the Fp, AF, F, FT, T, and C channels in the EASYCAP 64-channel system. The posterior half consisted of 37 electrodes covering occipital and parietal cortex, covering the C, T, CP, P, TP, PO and O channels. The central and temporal channels were included in both halves. For both analyses, classification procedures were the same as described for the analysis including all electrodes.
As an additional measure of spatial localization, we examined the distribution of classifier weights obtained from training classifiers on data from all sensors. During classification analysis, each feature (i.e., here each EEG electrode) is assigned a weight corresponding to the degree to which its output is used by the classifier to maximize class separation. Therefore, classification weights index the degree to which different electrodes contain class-specific information. To directly compare the weights of electrodes across time, we transformed weights into activation patterns by multiplying them with the covariance in the training dataset [83]. For display purposes, we projected the reconstructed activation patterns onto a scalp topography (Figures S1X and S1Y). This analysis of classifier weights was done twice: once for classifiers trained on data from the perception task, and once for classifiers trained on data from the imagery task. We thereby obtained two sets of classifier weights across the scalp and across time, which allowed us to localize features relevant for detecting shared representations in sensor space.
To characterize the nature of the representations shared between imagery and perception we used representational similarity analysis [10, 11] in combination with computational models. The basic idea is that representations shared between imagery and perception are related to representations in computational models if they treat the same conditions as similar or dissimilar. To determine this, in a first step condition-specific multivariate patterns in the neural (here: EEG sensor patterns) and the model (e.g., model unit activation patterns) coding spaces are compared for dissimilarity independently. Dissimilarity values are aggregated in so-called representational dissimilarity matrices (RDMs) indexed in rows and columns by conditions compared (here: 12 × 12 RDMs indexed by the 12 objects). In a second step the neural RDMs and model RDMs are then related to each other by determining their similarity. We described the detailed procedures to construct neural and model RDMs as well as their comparison below.
The procedure to construct neural RDMs was as follows. Classification accuracy can be interpreted as a dissimilarity measure on the assumption that the more dissimilar activation patterns are for two conditions, the easier they are to classify [6, 84]. Classification accuracy at each time point combination in the cluster indexing shared representations between imagery and perception (Figure 1F) is the average of a 12 × 12 matrix of cross-classification accuracies for all pairwise object combinations. Here, instead of averaging across its entries, we extracted the full 12 × 12 RDM for each time point in the cluster and averaged the RDMs across all time-point combinations, yielding a single RDM for each participant. Thus, each participant’s RDM indicates the dissimilarity for object representations shared between imagery and perception.
To characterize the nature of these shared representations we extracted model RDMs from a set of computational models. These models mirrored the objects’ (i) visual dissimilarity, (ii) their semantic category dissimilarity, and (iii) their auditory dissimilarity (i.e., the dissimilarity of the word sounds used to cue imagery). The construction of model RDM was as follows.
As the visual model, we used the 19-layered deep convolutional neural network (DNN) VGG19 [56] pretrained to categorize objects of the ImageNet dataset [85]. Using the MatConvNet toolbox [82], we ran the 12 object images used in this study through the DNN and then constructed layer-specific model RDMs by quantifying the dissimilarity (1-Pearson’s R) of response patterns observed along each of the 19 layers of the DNN. We constructed 8 aggregated RDMs from these results. The first five RDMs were constructed from convolutional layers, averaging RDMs of convolutional layers positioned between max pooling layers, starting with the input layer (RDM1: convolutional layers 1,2; RDM2: convolutional layers 3,4; RDM3: convolutional layers 5-8; RDM4: convolutional layers 9-12; RDM 5: convolutional layers 13-16). The last three RDMs were constructed from activations in the three final fully connected layers each (RDM6-8).
For the semantic category model, we modeled category membership in a binary way. For this model, we split our 12 objects into four sets of superordinate-level category membership: animals (butterfly, chicken, sheep), body parts (ear, eye, hand), plants (apple, carrot, rose), and man-made objects (car, chair, violin). We then constructed a model RDM in which objects of the same category were coded as similar (−1) and objects from different categories were coded as dissimilar (+1).
We considered two auditory models: a canonical spectrotemporal model inspired by psychoacoustical and neurophysiological findings in early and central stages of the auditory system [57], and a DNN with two branches trained on musical genre and auditory word classification respectively [58]. We ran all word sounds used in this study through the spectrotemporal and the auditory DNN. We constructed auditory model RDMs by quantifying the dissimilarity of response patterns (1-Pearson’s R) observed in the 2 stages (i.e., auditory spectrograms and estimated cortical spectrotemporal features) of the spectrotemporal model and the 11 layers along the auditory DNN (i.e., 3 early shared convolutional layers and 4 layers (the first two convolutional, the latter two fully connected) along the two branches trained on genre and word classification respectively).
To quantify how well the different models were related to the representations shared between imagery and perception in the alpha frequency band we correlated (Spearman’s R) each model RDM with each participant’s neural RDM.
Additionally, to establish how well the visual and auditory models explained the organization of visual representations (within the perception task) and auditory representations (within the imagery task) respectively, we compared these models with neural RDMs extracted from classification analyses within the perception and imagery tasks (Figures S2D and S2E). For this we averaged the RDMs at time points that fell in the within task classification clusters into a single neural RDM for each task and proceeded with representational similarity analysis as described above for the cross-classification analysis.
In addition to classifying objects from oscillatory responses, we also performed conventional classification analyses [6, 86] on broadband responses (i.e., single trial raw unfiltered waveforms). These analyses followed the same logic as the classification analysis on time-frequency data, including the averaging of individual trials into pseudo-trials prior to classification analyses. As the only difference, classifiers were now solely trained and tested on response patterns across all electrodes for every time point (with the original acquisition resolution of 1000 Hz), without any frequency decomposition. As for the classification analysis on oscillatory responses, we performed a time-generalization analysis, where we cross-classified objects between perception and imagery (Figure S1G), and a within-task classification analysis, where we classified objects in each of the two tasks separately (Figure S1N).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.