Semantic segmentation refers to the classification of each pixel of an image through computer deep learning, which realizes the separation and labeling of different types of objects in the picture. The input picture, GroundTruth, and network output often have the same size. The most representative semantic segmentation models are UNet, SegNet, FCN, PSPNet, DeepLab, etc. [20,21,22,23].
DeepLab is a semantic segmentation model that has good performance on public datasets such as VOC. Among them, the DeepLabV3+ model is the current DeepLab model with better effectiveness. We used a variety of semantic segmentation models to conduct experimental comparisons on the kiwi leaf dataset, including DeepLabV1, DeepLabV2, DeepLabV3 and DeepLabV3+, and finally achieved a significant improvement in accuracy.
DeepLabV1 is an improvement of the VGG network. It tries to fuse multi-level information by connecting to the convolutional layer after the Maxpool layer. DeepLabV2 mainly introduces atrous spatial pyramid pooling (ASPP) on the basis of DeepLabV1 to enhance the model’s ability to recognize objects of the same category of different sizes. On the basis of DeepLabV2, DeepLabV3 adds a hole convolution of different rates in the back end of the model and introduces batch normalization in ASPP. DeepabV3+ adjusts the structure of DeepLabV3 to form an encoder and decoder similar to U-net, allowing the model to achieve better results at the edge of segmentation. The modified Xception is then introduced to enhance the robustness of the model classifier [24]. The DeepLabV3+ model structure is shown in Figure 3.
DeepLabv3+ network structure diagram, where, .., … are the symbol of omission.
Axial–DeepLabV3+ introduces the Axial–Attention module in DeepLabV3+ to achieve a better attention mechanism effect while ensuring that the parameters are within an acceptable range [25].
Therefore, in the experiment, we introduce this module into DeepLabv3+ to increase the model’s attention to the injury area and to ensure the accuracy of the model’s identification.The schematic diagram of the module is shown in Figure 4.
The module of axial.
We tried the three classifiers, Xception and MobileNet [26], and ResNet101 proposed by DeepLabV3+ and compared the models in consideration of the amount and accuracy of the model parameters in order to determine the optimal classifier to improve the injury recognition accuracy of classification.
The Backbone selection process also plays the role of accurate rate comparison with the direct image classification algorithm to reflect the performance of the semantic segmentation model for the classification of plant leaf injuries, and at the same time exploring the best plant leaf injury recognition and classification method.
Choosing a suitable loss function is conducive to the improvement of the accuracy of the model. The experiment compares two loss functions: Focal Loss and Dice Loss.
(1) Focal loss: In the classification process, the background class is often easy to classify but difficult to classify different types of injuries. Therefore, the classification difficulty varies [27], which is suitable for optimization through Focal Loss. When the number of negative samples is large, it accounts for most of the total loss, and most of them are easy to classify, such that the optimization direction of the model is not as expected, and we can control the shared weight and control of loss by positive and negative samples. The weights of easy-to-classify and difficult-to-classify samples are used to optimize the loss function. After optimization, the loss function is as follows, where represents the predicted value of the model, and γ and are two factors based on the standard cross entropy loss function:
(2) Dice Loss: It is observed that there is a large gap between the background and the ratio of injuries, which applies the loss function optimization through Dice Loss [28]. Dice Loss can be defined as follows, where represent the number of common elements of A and B. and represent the number of elements in each collection:
It can increase the impact of leaf injury area on the loss function, thereby increasing the accuracy, robustness and applicability of the model.
In order to prevent the learning rate from being too large, in which condition it will oscillate back and forth when it converges to the global best point, the learning rate should be continuously reduced as the epochs grow. In addition, the learning step size of the convergence gradient should be reduced in order to achieve more stable and accurate training results.
Commonly used learning rate decay strategies include exponential_decay, natural exp decay, cosine decay, etc. We used noisy linear cosine decay, which is often used in reinforcement learning and in semantic segmentation model training to study its role in the field of computer vision [29].
Noisy_linear_cosine_decay adds noise to the decay process on the basis of linear cosine decay, which increases the randomness and possibility of finding the optimal value of lr to some extent. It is also an improvement on cosine decay, and its calculation formula is as follows, where stands for random noise factor, and and stand for factors controlling the gradual decline of learning rate. Equation (2) represents noisy linear decay. , , and stand for learning rate of this epoch, learning rate of the beginning, and minimum learning rate, respectively. stands for the maximum epoch.
Based on the above analysis, we innovatively proposed a two-stage leaf disease recognition algorithm. The algorithm flow-chart is as Figure 5 follows.
Overall processing flow of the network.
The training of the model was completed using Windows 10 operating system and Pytorch framework. The CPU model of the test equipment was Intel®Core™ i9_10900K CPU@3.70 GHz, the GPU model was GeForce RTX 5000 16 G, and the software environment was CUDA 10.1, CUDNN 7.6, Python3.7. All experiments were trained with default parameters.
This paper introduced Precision (P), namely precision rate, recall rate (Recall, R), and Mean Average Precision (mAP) to evaluate the performance of the kiwi defect detection model. The expressions of P and R are as follows:
Among them, TP (true positive), FP (false positives), and FN (false negatives) respectively represent positive samples with correct classification, negative samples with incorrect classification, and positive samples with incorrect classification.
AP is the average accuracy rate, which is the integral of the P index to the R index, that is, the area under the PR curve; mAP is the mean average accuracy, which means taking the average value of AP of each category. They are defined as follows:
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.