request Request a Protocol
ask Ask a question
Favorite

The GA module allows for the direct training of sparse and shape-variable anchors. In general, anchors can be represented by four parameters (x, y, w, h), which correspond to the coordinates of the anchor’s center point as well as its width and height. The probability distribution of anchors can be decomposed into two conditional probability distributions, described in Eq 1.

p(x, y, w, h|I) represents the probability distribution of anchor center points given the feature map, and p(w, h|x, y, I) represents the probability distribution of anchor width and height given the feature map and center point. According to Eq 1, the GA module is designed with two sub-networks: the location prediction sub-network NL and the shape prediction sub-network NS. The structure of the GA module is illustrated in Fig 5.

The objective of the location prediction branch in GA is to predict which regions should be selected as anchor center points. In the location prediction sub-network, the entire feature map is divided into object center regions. peripheral regions, and ignored regions. The region corresponding to the center of the ground truth box on the feature map is marked as the object center region and treated as a positive sample during training. The remaining regions are labeled as ignored or negative samples based on their distance from the center. Through position prediction, a small subset of regions can be selected as candidate anchor center points, significantly reducing the number of anchors.

The purpose of the shape prediction branch is to predict the optimal length and width of an anchor given its center point as the size of defects are intricate. IoU is used as the supervision to learn the values of width and height. Since IoU is differentiable, the network can be trained to maximize the IoU value. The matching between anchors and ground truth is represented by the following Eq 2:

awh represents the width and height of the anchor, and gt represents the ground truth value. Because of the large range of w and h when learning them directly, they are transformed using the following Eq 3:

σ is an empirical scale factor (set to 8), and s is the stride. The shape prediction branch outputs dw and dh, which are then mapped to w and h.

In the RPN, anchors are uniformly distributed across the entire image in a sliding window-like manner, and each anchor is associated with a feature map of the same size. However, in GA, the size of anchors is not fixed, and it is necessary to use features with different receptive fields to leverage the advantages of different anchor sizes. To achieve this, feature adaptation is required, and its formula is Eq 4:

NT represents a 3 × 3 deformable convolution, and i is the index of each anchor. The shape information generated by the shape prediction sub-network is incorporated into the original feature map through the deformable convolution. Finally, we employ a multi-task loss for end-to-end training, with the loss function is Eq 5:

Lcls represents the classification loss, Lreg represents the regression loss, Lloc represents the location prediction loss, and Lshape represents the shape prediction loss.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A