The upsampling operation was achieved through feature recombination, which involves the dot product between the upsampling kernel and the corresponding neighborhood pixels in the input feature map. The fundamental network structure, with a small receptive field, ignores some useful information, and therefore, the receptive field needs to be enlarged. The upsampling operation CARAFE, on the other hand, can have a large receptive field during reorganization and guides the reorganization process based on the input features. Meanwhile, the whole CARAFE operator structure is small, which meets the lightweight requirement. Specifically, the input feature map is utilized to predict unique upsampling kernels for each position, followed by feature recombination based on these predicted kernels. CARAFE demonstrates significant performance improvements across various tasks while only introducing minimal additional parameters and computational overhead.
CARAFE consists of two primary modules: the upsampling kernel prediction module and the feature recombination module, as depicted in Figure 4. Assuming an upsampling multiplier of and an input features map with dimensions , the process begins by predicting the upsampling kernel through the upsampling kernel prediction module. Subsequently, the feature recombination module is employed to complete the upsampling procedure, resulting in an output feature map with dimensions .
The overall framework of CARAFE. CARAFE is composed of two key components, i.e., kernel pre-diction module and content-aware reassembly module. A feature map with size C × H × W is upsampled by a factor of σ (=2) in this figure.
Given an input feature map of shape , our initial step involves channel compression, reducing the channel number to using a operation. The primary objective of this compression is to alleviate the computational burden on subsequent steps. Following that, we proceeded with content encoding and upsampling kernel prediction, assuming a specific upsampling kernel size of . It is worth noting that a larger upsampling kernel offers a broader perceptual field, but it also entails a higher computational cost. To incorporate distinct upsampling kernels for each position in the output feature map, it is necessary to predict the shape of the upsampling kernel as . In the initial step, after compressing the input feature map, we employed a convolutional layer with channels to predict the upsampling kernel. The number of input channels is , and the number of output channels is . Following this, we expanded the channel dimension across the spatial dimension, resulting in an upsampling kernel with the shape .
At each location within the output feature map, we performed a mapping back to the corresponding region in the input feature map. This region, centered on the location, encompassed a region of size . Subsequently, we computed the dot product between this region and the predicted upsampling kernel specific to that point, resulting in the output value. It is worth noting that different channels at the same location shared the same upsampling kernel.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.