In order to create a system that can accurately estimate the energy and macronutrient intake from the RGB-D images before and after the meal consumption, a segmentation network needs to be trained first. A segmentation network receives as its input a single RGB image and partitions it into different elements. In this project, it is needed to segment the pictures into the six food types (soup, meat/fish, side dish, sauce, vegetables/salad, dessert), the four plate types (round plate, soup bowl, square bowl, glass) and the background. To obtain the GT of the segmentation (GTseg) for the 332 images, we used a semi-automatic segmentation tool that was developed by our team (Figure 5). The tool automatically provides a segmentation mask for each image, which can then be refined and adjusted by the user.
The segmentation tool that was used to provide the ground truth of the segmentation (GTseg): (a) the interface of the segmentation tool; (b) the semi-automatic segmentation; (c) the segmented plates of the images (up) and food types (bottom).
For the segmentation network, we experimented with two network architectures: The Pyramid Scene Parsing Network (PSPNet) [27] and the DeepLabv3 [28] architecture.
For the PSPNet, a convolutional neural network (CNN) was firstly used to extract the feature map for the image (size of 30 × 40 × 2048). For the CNN, we used either the ResNet50 [29] architecture (ResNet + PSPNet) or a simple encoder with five stacks of convolutional layers in a row (Encoder + PSPNet). A pyramid parsing module was applied to the feature map to identify features in four different scales (1 × 1 × 2048, 2 × 2 × 2048, 3 × 4 × 2048, and 6 × 8 × 2048). The four new feature maps were then upsampled to 30 × 40 × 512 and concatenated with the original feature map. A deconvolutional layer was applied in order to resize the maps to the original size of the image. This procedure was implemented twice, for the plate and the food segmentation, and the outputs received were 480 × 640 × 5 (four different plates and background) and 480 × 640 × 8 (six meal-dishes, plate, and background) in size.
Similarly, for the DeepLabv3, the ResNet50 CNN was used to extract the feature map. An Atrous Spatial Pyramid Pooling is then added on top of the feature map, which performs (a) a 1 × 1 convolution, (b) three 3 × 3 convolutions with different dilation rates (the dilation rate adapts the field-of-view of the convolution), and (c) an image pooling module to include global information. The results are then concatenated, convoluted with kernel 1 × 1, and upsampled to obtain the two outputs for the plate and the food segmentation.
Both the networks were trained with the “Adadelta” optimiser and a batch size equal to 8, for up to 100 epochs. We also experimented by adding and removing the plate segmentation module in the architecture.
From the 332 images in total, 292 were used for training of the segmentation network and 40 images (20 before and 20 after consumption) for testing. For the segmentation network, we compared the PSPNet and the DeepLabv3 architectures with ResNet as the backbone (with and without the module for plate segmentation) and the PSPNet with the simple encoder as the backbone network. In order to evaluate the performance of the segmentation network, the following metrics were used: (a) the mean Intersection over Union (mIoU), which is the mean of the intersection of GTseg pixels and the predicted pixels for each food category, divided by the union of these (1), (b) the accuracy of the segmentation, which is the intersection of GTseg and the predicted pixels divided by the number of GTseg pixels for each food category (2), (c) the index (3) which represents the worst food category performance, and (d) , which represents the average food category performance (4), from a segmentation S to a segmentation T (S and T are the GTseg and the predicted segmentation masks). Each index is used in both directions (from S to T and from T to S) to estimate the total and (5), which are the harmonic means of the minimum and the average indexes, respectively.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.