This section provides an in-depth description of ecoset category and image selection procedures. Please refer to SI Appendix, Table S1 for a list of all 565 ecoset categories together with their word frequency, concreteness rating, frequency concreteness index (FCI), and the corresponding number of images.
Ecoset was created as a large-scale image resource for deep learning and human visual neuroscience more generally (see ref. 43 for a related dataset designed for experimental work in psychology and neuroscience). A total of 565 categories were selected based on the following: 1) their word frequency in American television and film subtitles (SUBTLEX_US, 10), 2) the perceived concreteness by human observers (11), and 3) the availability of a minimum of 700 images. Images were sourced via the overall ImageNet database (the same resource used for ILSVRC 2012) or obtained under CC BY-NC-SA 2.0 license from Bing image search and Flickr. Thorough data cleaning procedures were put in place to remove duplicates and to assure an expected misclassification rate per category of <4%.
The aim of ecoset was to provide the community with a dataset that contains ecologically more valid categories than typical computer vision datasets that were designed toward engineering goals. Starting from all nouns in the English language, two parameters were used to guide the selection process. First, the frequency at which a given noun occurs in a linguistic corpus of spoken language was used as a proxy for concept importance. Second, human ratings of each noun's concreteness were used to focus on categories that have a physical realization and which can therefore be readily visualized (compare for example the nouns “strawberry” and “hope,” which are at opposing ends of the concreteness spectrum). Only nouns with an associated concreteness rating of 4.0 or higher were considered for inclusion. We then combined the two selection parameters, frequency, and concreteness by defining an FCI (defined below). This enabled us to focus on the most common, most concrete nouns of the English language.
Estimates of noun frequency were based on a linguistic corpus consisting of American television and film subtitles (SUBTLEX_US, 10). Concreteness estimates were publicly available (11). These data were collected via Amazon Mechanical Turk, asking participants to rate words (40,000 total) with regard to their concreteness on a five-level Likert scale. Frequency estimates and concreteness ratings were each standardized to a range between 0 and 1. FCI was subsequently defined as the average standardized frequency and concreteness. It ranges from 0 to 1. We computed the FCI for all words contained in the concreteness rating dataset (11) and processed the 3,500 nouns with the highest FCI rating in depth.
Only nouns that describe basic-level categories were considered for inclusion. Please note that the definition of basic-level categories is a matter of an ongoing scientific debate, and basic-level judgments can vary across individuals (44). Because of its inherently subjective nature, the classification of nouns that constitute basic-level categories was performed repeatedly across the whole set by the authors, and the selection was subsequently verified by two project independent researchers.
In detail, category selection was performed using the following criteria: First, nouns describing subordinate and superordinate categories were excluded in favor of basic-level categories (for example, “terrier” and “animal” were excluded in favor of “dog”). Moreover, only single-word concepts were included as candidates, excluding separated compound nouns as their own entities (e.g., “sail boat,” “fire truck,” etc.), as these are often part of a basic-level category (in the previous example “boat” and “truck,” respectively). Third, we excluded nouns describing object parts (e.g., “wheel,” “roof,” or “hand”), as they constitute parts of objects in other basic-level categories, thereby rendering the image categories ambiguous. Moreover, although the human brain exhibits visual areas that appear uniquely selective to certain categories, such as body parts [faces, hands, etc. (5)], such selectivity should ideally emerge as a result of network training according to an externally defined objective. Including them as explicit training targets would prohibit analyses of such emergent phenomena. Fourth, synonyms were combined into a single category (e.g., “automobile” and “car” are summarized into a single “car” category). The resulting set of nouns describes basic-level categories for which the resulting images can be ascribed to a single category as commonly used in many one-hot encoded deep learning applications. The final set of ecoset categories is distinctively different from the category selection ILSVRC. First, ecoset focuses on basic-level categories rather than category labels from various levels of categorical abstraction. Second, only 24% of categories in ecoset have a matching ILSVRC category. As a more conservative estimate, we furthermore included comparisons across category levels by including all WordNet hyponyms of each ecoset category for comparisons (e.g., counting the ILSVRC category “Brittany spaniel” as a match to ecoset’s “dog”). Please note that this match across category levels (i.e., matching basic-level ecoset categories to subordinate categories in ILSVRC) is quite conservative, as the underlying categorization task is different. Nevertheless, we find only 16% of ecoset categories to have a matching WordNet hyponym in ILSVRC.
Most images (∼94%) were sourced from the ImageNet database [of which the well-known ILSVRC 2012 dataset with its 1,000 object categories is a subset (4)]. To compute the actual image-based overlap between ecoset and ILSVRC, we ran a similar analysis used for duplicate removals, as described in detail below, across both datasets (ecoset and ILSVRC). We find that only 12.7% of images in ecoset also appear in ILSVRC 2012, indicating little overlap between the two datasets. To find images matching a given ecoset category, we used the ImageNet web interface to manually search for appropriate WordNet synsets to be included. Multiple synsets could be selected as sources for a given category.
As additional resources for finding images, we used Bing and Flickr image searches based on the category names, synonyms, and their translations into other languages (French, Spanish, Italian, and German). Image search via Flickr and Bing was constrained to images under CC BY-NC-SA 2.0 license. For the Flickr application programming interface (API), we chose option one (NonCommercial-ShareAlike License), and for the Bing API we chose the option “share,” both referring to CC BY-NC-SA 2.0. In the final ecoset dataset, 5.1% of images were obtained via Bing and 1.4% were obtained via Flickr.
To maximize the probability that all images in the ecoset dataset are unique, a duplicate removal procedure was implemented. This was designed to not only spot exact duplicates but also more subtle variations, including different sizes or different aspect ratios. Duplicate removal was performed for each category separately. First, we cropped the center square of all images of the category, resized them to 128 × 128 pixels, and performed a principal component analysis (PCA) preserving 90% of the variance across all images of that category. The similarity of all image pairs was computed based on a Pearson correlation between their respective PCA component loadings. Based on 10 exemplary categories, we established a cutoff value above which a pair of images was labeled as duplicate (Pearson r > 0.975). If multiple duplicates per category instance existed, only the image with the largest resolution was kept for ecoset.
We performed a manual image inspection procedure to ensure that the ecoset images were correctly classified. All images sourced via Bing and Flickr (97,379 images in total) were visually inspected, and misclassified instances were removed. For images obtained via ImageNet, we visually inspected 100 randomly sampled instances from each ecoset category. If more than four of those 100 images were found to be misclassifications, the whole category was manually cleaned. Otherwise, all images were included. As a result of this cleaning procedure, we expect the error rate of all ecoset categories to be lower than 4%.
Due to the large-scale sampling of images via the web required for ecoset, some of the images used to train the DNN models contained nudity. These images were removed in creating the publicly available version of ecoset to allow for more straight forward adoption by all community members. Images were marked for removal if the probability of containing not safe for work (NSFW) material exceeded 0.8, estimated using a DNN trained for NSFW detection [Yahoo (45), https://github.com/yahoo/open_nsfw]. Note that only 118 (out of >1.5 million) images had to be removed.
Ecoset and ILSVRC 2012 differ in the number of categories (565 versus 1,000) and in the distribution of the number of images per category. These differences might confound their ability to predict neural data. To control for this possibility, we created “trimmed” versions of both datasets that are identical in the number of categories and the distribution of the number of images per category. For this, we selected all 565 categories from ecoset and a subset of 565 randomly chosen categories from ILSVRC 2012. To hold the number of images per category equal across trimmed image sets, while retaining the maximally possible number of images, the following procedure was implemented. First, we ordered the 565 categories of ecoset and trimmed ILSVRC 2012 according to category size and paired the categories from the sorted list across images sets (e.g., pairing the largest category of ecoset with the largest category of ILSVRC). For each category pair, one from each dataset, we then selected the larger category and randomly removed images to match the number of images in the smaller category. As a result, trimmed ecoset and trimmed ILSVRC both contain 565 categories and follow the same distribution of category sizes with minimally 600 to maximally 1,300 images per category in the respective training sets.
As stated above, the category selection of ecoset was based on human concreteness ratings and word frequencies in a corpus consisting of American television and film subtitles. This undoubtedly biases the category selection toward Western cultures. Image inclusion was based on the availability via Bing/Flickr search results as well as the existence of relevant ImageNet categories. Images depicting people, specifically the categories “man,” “woman,” and “child,” were not sampled according to census distributions (age, ethnicity, gender, etc.). Moreover, ecoset image and category distributions do not reflect the naturalistic, egocentric visual input typically encountered in the everyday life of infant and adults (46, 47).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
 Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.