3.6. Components of the YOLOv4 Models

AP Addie Ira Borja Parico
TA Tofael Ahamed
request Request a Protocol
ask Ask a question
Favorite

In this study, the authors compared the performances of YOLOv4, YOLOv4-tiny and YOLOv4-CSP on detecting pear fruits. Table 3 shows the differences among these models in terms of their architectural components. In this section, how the elements of these models contribute to their respective characteristics is discussed more in detail.

Comparison of YOLOv4, YOLOv4-CSP and YOLOv4-tiny in terms of their architectural elements.

Cross-stage Partial (CSP) Connection is a technique to reduce computational complexity, which is originally derived from CSPNet [22]. To “CSP-ize” a network divides the feature map of the base layer into two parts then merges the two parts through transition → concatenation → transition (see Figure 4). CSP-ization improves the accuracy and reduces the inference time through truncation of gradient flow [4,22]. Also, CSP-ization enables scaling of the model. Because of these reasons, CSP connections were incorporated into the backbone of the YOLOv4 models. CSPDarknet53 was chosen as the YOLOv4 backbone despite having lower accuracy in image classification compared to CSPResNext50 [3]. The next section explains why.

Cross-Stage Partial Connection Block in YOLOv4-CSP.

Despite CSPResNext50’s better performance in image classification, it was not the case for object detection. CSP-ization of Darknet53 led to higher accuracy in object detection due to the following [4]:

Higher input network size, which led to the ability to detect more small-sized objects.

More convolutional layers 3 × 3, which led to a larger receptive field to cover the increased input network size.

Larger number of parameters for greater capacity to detect multiple objects of different sizes in a single image.

Other than CSP-ization, several techniques were used to improve the performance of CSPDarknet53 without putting a burden on the computational requirement: (1) data augmentation techniques such as CutMix [39] and Mosaic [3], (2) DropBlock [40] as a regularization method and (3) Class label smoothing [3]. Then, the following techniques were used to make the use of expensive GPUs no longer necessary in training: (1) Mish [24] as the activation function (further explained in Section 3.6.4), and (2) Multi-input weighted residual connections [41].

For YOLOv4-tiny, it is important to make the computations efficient and fast without sacrificing much the accuracy. Thus, one shot aggregation (OSA) (shown in Figure 5), which is derived from VoVNet [42], was implemented between the calculation modules of YOLOv4-tiny’s backbone CSPOSANet for smaller computation complexity. This resulted in the reduction of the size of the model and the number of parameters through the removal of an excess amount of duplicate gradient information. A Leaky Rectified Linear Unit was used as the activation function for CSPOSANet due to its faster speed in convergence [23].

One Shot Aggregation (OSA).

The Leaky Rectified Linear Unit (or Leaky ReLU) is a modified version of ReLU. The difference is that the former allows a small nonzero gradient over its entire domain, unlike ReLU (Figure 6). Deep neural networks utilizing Leaky ReLU were found to reach convergence slightly faster than those using ReLU. However, Leaky ReLU is slightly less accurate but has lower standard deviations compared to its more novel counterparts Swish and Mish [24]. However, Leaky Re Lu has better performance with under a 75% IoU threshold and with large objects and has lower computational cost due to lower complexity [24].

Activation Functions. (Left) Rectified Linear Unit (ReLU); (Center) Leaky ReLU; (Right) Mish.

Mish, on the other hand, is a smooth, continuous, self-regularized, nonmonotonic activation function that enables smoother loss landscapes which helps in easier optimization and better generalization. It has a wider minimum, and thus can achieve lower loss. Because of these benefits, neural networks implementing Mish led to higher accuracy and lower standard deviations in object detection. Moreover, it retains the feature of its predecessors (Swish and Leaky ReLU) in terms of unbounded above and bounded below. The former avoids saturation (which generally causes training to slow down), whereas the latter results in stronger regularization effects (fits the model properly).

Thus, Leaky Re LU would be more suitable if the goal was to maximize speed without sacrificing much of the accuracy. Then, if accuracy should be maximized, Mish would be the better option. Table 4 summarizes the activation functions used and their corresponding effects on each YOLOv4 model.

Summary of activation function used and the reason why the specified activation functions were used.

Path aggregation (Figure 7), originally proposed by Liu et al. [25], was used as the neck for YOLOv4 and YOLOv4-CSP in place of FPN (which was used in YOLOv3). This technique aggregates parameters from different backbone levels for different detector levels through bottom-up path augmentation and adaptive feature pooling. Bottom-up path augmentation shortens the information path and enhances the feature pyramid by making fine-grained localized information available to top layers (the classifiers). On the other hand, adaptive feature pooling recovers the broken information path between each proposal and all feature levels (cleaner paths are created). It fuses the information together from different layers using an element-wise max operation. Thus, PANet ensures that important features are not lost. For these reasons, PANet was used as the neck for YOLOv4 and YOLOv4-CSP.

Architecture of PANet, which inspired the path aggregation in YOLOv4’s neck. (a) FPN backbone; (b) bottom-up path augmentation; (c) adaptive feature pooling; (d) box branch; (e) fully-connected fusion (concatenation is done instead of addition for YOLOv4).

Spatial Pyramid Pooling (or SPP) is another feature of YOLOv4 and YOLOv4-CSP that eliminates the need for a fixed-size input image, making them more robust and practical. SPP is added on top of the last convolutional layer of YOLOv4 and YOLOv4-CSP. SPP pools the features and generates outputs with fixed-length, which are then fed into the classifier layer (Figure 8). In this study, the pooling was done through the spatially division of feature maps into different scales of d x d equal blocks, where d can be {1, 2, 3, …}. These different scales of division forms are called spatial pyramids. Then, max pooling was done for each level of division to produce a concatenated 1D vector (originally). SPP works similarly in the YOLOv4 models but the difference is the input feature map size is equal to the output feature map size through padding.

Spatial Pyramid Pooling in YOLOv4.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A