Representations of the same set of 92 images were computed from the layers of two convolutional neural network architectures with different depths. The first DNN was AlexNet (Krizhevsky et al., 2012) with eight layers. The second DNN was VGG-16 (Simonyan and Zisserman, 2015) with 16 layers. We chose these DNNs because they were the best-performing models in the ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al., 2015) in 2012 and 2014, respectively; and their architectures are relatively simple as compared to other DNNs. In that competition, the VGG network achieved 7.4% top-5 error rate, whereas AlexNet achieved 15.4% top-5 error rate. For reference, Microsoft’s 150-layer network recently obtained 4.94% top-5 error rate outperforming non-expert humans at 5.1% (He et al., 2016). AlexNet and VGG-16 were trained on 1.2 million images from the ImageNet database. The task was to classify each image as containing an object in one of 1,000 possible categories.
The DNN architectures share several principles with the architecture of the primate visual system, in particular: hierarchical organization, convolution, and pooling (Kriegeskorte, 2015; Yamins and DiCarlo, 2016). The hierarchical series of layers transforming information from simple features to a categorical representation mimics the successive cortical regions in the primate visual system. Convolution is inspired by biological vision where local features are replicated across the visual field. Pooling allows tolerance to position of image features, echoing the increasing view tolerance along the primate ventral stream (see Kriegeskorte, 2015; Yamins and DiCarlo, 2016 for further discussion).
The architectures of the two DNNs are schematically represented in Figure Figure22. AlexNet consists of eight layers, comprising five convolutional and three fully-connected layers. Three max-pooling layers follow the first, second and fifth convolutional layers. Each convolutional layer contains a number of “feature maps”, each of which consists of a single learned filter, applied systematically across spatial locations in the input layer (i.e., convolved with the input). The first convolutional layer has 96 feature maps, the second convolutional layer has 256, and the third, fourth and fifth convolutional layers have 384, 384 and 256 feature maps, respectively. VGG-16 consists of 16 layers including 13 convolutional and three fully-connected layers. Convolutional layers form five groups and each group is followed by a max-pooling layer. The number of feature maps increases from 64, through 128 and 256 until 512 in the last convolutional layers. Within each feature map, the size of the convolutional filter is analogous to the receptive field of a neuron. Units in VGG-16 have smaller receptive fields than in AlexNet across early layers, but both DNNs have the same size filters in the last pooling layer. We used convolutional (conv) and fully-connected (fc) layers from both networks in our analyses.
DNN architectures and feature weighting. (A) Comparison of AlexNet and VGG-16 architectures. We used convolutional (conv) and fully-connected (fc) layers from AlexNet and VGG-16 in our analyses. (B) Schematic overview of feature weighting. The schematic shows a set of example RDMs characterizing the stimulus information represented by the DNNs. For convolutional layers, we created RDMs from activations of feature maps. For fully-connected layers, we created RDMs from activations of individual features, i.e., model units. Within each DNN layer, we used regularized (non-negative ridge) linear regression to estimate the RDM weights that best predict the similarity-judgment RDM. Each DNN layer includes a confound-mean predictor (intercept). The weights were estimated using a cross-validation procedure to prevent overfitting to a particular set of images.
For each convolutional layer of each DNN, we extracted the activations in each feature map for each image, and converted these into one activation vector per feature map. Then, for each pair of images we computed the dissimilarity (squared Euclidean distance) between the activation vectors. This yielded a 92 × 92 RDM for each feature map of each convolutional DNN layer. For each fully-connected layer of each DNN, we extracted the activation of each single model unit for each image. For each pair of images, we computed the dissimilarity between the activations (squared Euclidean distance; equivalent here to the squared difference between the two activations). This yielded a 92 × 92 RDM for each model unit of each fully-connected DNN layer. The feature-map and model-unit RDMs capture which stimulus information is emphasized and which is de-emphasized by the DNNs at different stages of processing.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.