4.1. YOLOv5

WL Wojciech Lindenheim-Locher
Adam Świtoński
TK Tomasz Krzeszowski
GP Grzegorz Paleta
PH Piotr Hasiec
HJ Henryk Josiński
MP Marcin Paszkuta
KW Konrad Wojciechowski
JR Jakub Rosner
ask Ask a question
Favorite

The high performance of deep neural networks in image recognition is caused by the availability of huge datasets and extensive computational resources. The generalization of their work is achieved by the multilayer architectures containing convolutional and fully connected layers. The neuron weights are updated by the gradient descent procedure and the backpropagation algorithm.

One of the most popular algorithms using convolutional neural networks for object detection in images and videos is YOLO (You Only Look Once). Among other applications, it has been used in face mask recognition [26], object detection on drone-captured scenarios [27], and heavy goods vehicle detection [28]. What sets it apart from most other solutions is its performance. It is a single-shot algorithm, meaning that it makes predictions for all objects in an image or video frame in a single pass. This makes it well-suited for real-time object detection on video, where speed is critical.

The YOLOv5 version that uses PyTorch instead of the Darknet framework was selected. The network structure contains three main components: a backbone, a neck, and a head, as shown in Figure 8. In the backbone, the new CSP-Darknet53 architecture was applied. It uses the C3 layer, which is a simplified version of the used CSP Bottleneck layer, by removing one of the four main convolutions from inside of CSP Bottleneck layer. To reduce the number of parameters, truncation of the gradient flow is performed. The CSP networks preserve DenseNet’s feature reuse qualities and reduce the redundant gradient information that normally occurs, which helps to increase the inference speed [29]. In the neck block, a modified version of the PANet (Path Aggregation Network) with C3 layers and the SPPF (Spatial Pyramid Pooling Fast) have been used [30]. The SPPF is an improved, faster version of the popular SPP with an increased flow of information, making it easier to locate pixels correctly. The head block is the same as the one used in YOLOv3 and YOLOv4, which contains three convolution layers that are used to predict the location of bounding boxes and calculate the scores. In our case, the head block was modified by performing the transfer learning, which starts the training with pre-trained weights achieved for the COCO dataset. The transfer learning may result in a less precise network, which is originally adapted to a different detection problem. However, in most cases, it allows us to obtain satisfactory results with fewer training samples and with a lower computational cost and minimize the probability of network overfitting.

Simplified architecture of the YOLOv5.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A