Masked autoencoders are inspired by natural language processing techniques, where models are pre-trained by predicting missing words in a sentence52. In the imaging domain, starting from a large unlabeled image dataset, masked autoencoders pre-train deep learning models by dividing the images into patches, masking some of the patches, and training the model to reconstruct the original unmasked images. He et al.12 show that masked autoencoders outperform state-of-the-art contrastive methods when pre-training vision transformer models. However, applying masked autoencoders to convolutional models (CNNs) showed only moderate success, and contrastive methods remained superior8,26. This can be attributed to the characteristics of the models. While transformer models have a variable input size and can drop masked patches, CNNs have a fixed input size and must set masked patches to a fixed value. As evaluated by Tian et al.26, the sliding window kernels of CNNs that overlap between masked and non-masked patches result in a loss of mask patterns after several convolutions. They hypothesize that this leads to the moderate success of masked autoencoders for CNNs and try to solve this challenge by using sparse convolutions53. This results in a model that skips all masked positions, preventing vanishing mask patterns and ensuring a consistent masking ratio. They use these findings for their self-supervised pre-training approach “SparK”. When pre-training a ResNet5042 model with ImageNet27 data, SparK outperforms all state-of-the-art contrastive methods26. Their approach is the first successful adaption of masked autoencoders to CNNs.
We pre-train our model with CT slices by applying the self-supervised learning method SparK. We use the original PyTorch54 implementation from the publication. Details can be found in Supplementary Information. In the following, we explain the SparK method:
SparK Starting with a dataset of images , each image is divided into non-overlapping square patches, where each patch is masked independently with a given probability. The probability is a hyperparameter called “mask ratio”. The images are converted to spars images by sparsely gathering all unmasked patches. As shown in Fig. Fig.3,3, the SparK model consists of an encoder, which can be any convolutional model and a decoder. The encoder is transformed to perform submanifold sparse convolutions53. Submanifold Sparse convolutions only compute when the center of a sliding window kernel is covered by a non-empty element. This causes the encoder to skip all masked places. The decoder is built in a U-Net55 design with three blocks of upsampling layers, receiving feature maps from the encoder in four different resolutions. This is referred to as “hierarchical” encoding and decoding. The empty parts of the feature maps from the encoder are filled with learnable mask embeddings M before being computed by the decoder to obtain dense feature maps. This is called “densifying”. SparK further adds a projection layer between the encoder and decoder for all computed resolutions in case they have different network widths. To reconstruct the image, a head module is applied after the decoder with two more upsampling layers to reach the original resolution. The model is trained with an L2 Loss between the predicted images of the model and original images , computed only on masked positions. After the pre-training, only the encoder is used for the downstream tasks. The sparse convolutions of the encoder can be applied directly to non-masked images without modification since normal convolutions are performed when the images have no masked patches.
Illustration of the masked autoencoder method SparK for self-supervised pre-training. is the input image which is divided into patches that are randomly masked and sparsely gathered to the sparse masked image . The masked image is computed by a U-Net shape encoder-decoder model that is trained to reconstruct the original image . The encoder performs sparse convolutions that only compute when the center of a sliding window kernel is covered by a non-masked patch.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.