R-FCN is a typical two-stage object detection method. In the first stage, the Regional Proposal Network (RPN) is used for regional proposals to generate candidate RoI. In the second stage, R-FCN uses position-sensitive score maps to synthesize the features of different positions of ROIs so that the network can solve the dilemma between the translation invariance in classification and the translation variance in object detection. At the same time, all the learnable weight layers are convolutional and can be calculated in the whole image. Finally, the entire network reaches the structure of full convolution, which significantly improves efficiency.
The overall architecture of the metallic stent strut detection based on R-FCN is shown in Figure 4. After extracting features through a series of convolutions in Resnet-50, a Region Proposal Network (RPN) uses a small sliding window and anchor boxes to generate candidate regions on a whole feature map. For the metallic stent strut and the background, the feature map of the entire image is, respectively, connected with 3∗3 position-sensitive score maps by convolution. Combining the RoI pooling of 9 position-sensitive scores, the category probability corresponding to each RoI can be voted. The four localization parameters that represent the offset from the anchor boxes are also obtained by voting similarly. After training the network, R-FCN outputs the adjusted new position and score of the metallic stent strut RoIs as “R-FCN output.” If the category score of each RoI is less than the score threshold, we remove the bounding box to get a “Threshold output.” The remaining bounding boxes still have a lot of overlap. Run a nonmaximum suppression (NMS), and only the bounding box with the highest score is kept where the IoU exceeds a certain threshold. The remaining bounding box is the final “Detection result.”
Architecture of metallic stent detection based on R-FCN.
RPN uses a fully convolutional network to output a set of rectangular region proposals at once on the entire feature map. Slide a small sliding window on the feature map, and use each area located by it as input. If k (k = 9) anchor boxes are used as the regression reference, each sliding window will output 4 k coordinate regression tx, ty, tw, th and 2 k bounding box classification to estimate the probability that each proposal is the object or not.
The RPN loss function consists of two parts, the log classification loss, and the smooth regression loss:
where the smooth L1 is defined by
{pi}, {ti} are the outputs of the anchor in the classification layer and regression layer. During training, we assign labels to the anchor based on the IoU of the anchor i and the ground truth box. A positive label is 1, and a negative label is 0. ti∗ is the vector about the ground truth box location associated with the positive anchor.
RPN only relies on a single-scale image and feature mapping, uses a single-size filter, and thus generates a region proposal that is translation-invariant. Shared features require no additional cost to process the scale of the object.
The innovation of R-FCN is the position-sensitive score map. Object classification and location all need 3∗3 score maps. We take the position-sensitive score maps of the stent strut classification as an example. 9 position-sensitive score maps correspond to features of nine positions of the strut. Each position-sensitive map in the RoI area is divided into 3∗3 bins, and a position-sensitive RoI pooling operated only over the appropriate bin of each score map:
Nine pool responses vote on the RoI by averaging; then, the classification probability of RoI is output by the softmax function.
Bounding box regression is similar, except that the output after voting is the 4 d vector (tx, ty, tw, th).
The loss function for each RoI includes cross-entropy loss for classification and regression loss for the location of the positive sample:
Regression loss is the same as RPN's. C∗ represents the label of the RoI. [C∗ > 0] means that if the label is positive, it is equal to 1; otherwise, it is 0.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.