Masked autoencoder

Daniel Wolf; Tristan Payer; Catharina Silvia Lisson; Christoph Gerhard Lisson; Meinrad Beer; Michael Götz; Timo Ropinski

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Masked autoencoder

DW Daniel Wolf

TP Tristan Payer

CL Catharina Silvia Lisson

CL Christoph Gerhard Lisson

MB Meinrad Beer

MG Michael Götz

TR Timo Ropinski

This method is extracted from research article: Sci Rep, Nov 2023

Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging

DOI: 10.1038/s41598-023-46433-0

Request a Protocol

Ask a question

Favorite

Masked autoencoders are inspired by natural language processing techniques, where models are pre-trained by predicting missing words in a sentence^⁵². In the imaging domain, starting from a large unlabeled image dataset, masked autoencoders pre-train deep learning models by dividing the images into patches, masking some of the patches, and training the model to reconstruct the original unmasked images. He et al.^¹² show that masked autoencoders outperform state-of-the-art contrastive methods when pre-training vision transformer models. However, applying masked autoencoders to convolutional models (CNNs) showed only moderate success, and contrastive methods remained superior^⁸,²⁶. This can be attributed to the characteristics of the models. While transformer models have a variable input size and can drop masked patches, CNNs have a fixed input size and must set masked patches to a fixed value. As evaluated by Tian et al.^²⁶, the sliding window kernels of CNNs that overlap between masked and non-masked patches result in a loss of mask patterns after several convolutions. They hypothesize that this leads to the moderate success of masked autoencoders for CNNs and try to solve this challenge by using sparse convolutions^⁵³. This results in a model that skips all masked positions, preventing vanishing mask patterns and ensuring a consistent masking ratio. They use these findings for their self-supervised pre-training approach “SparK”. When pre-training a ResNet50^⁴² model with ImageNet^²⁷ data, SparK outperforms all state-of-the-art contrastive methods^²⁶. Their approach is the first successful adaption of masked autoencoders to CNNs.

We pre-train our model with CT slices by applying the self-supervised learning method SparK. We use the original PyTorch^⁵⁴ implementation from the publication. Details can be found in ^{Supplementary Information}. In the following, we explain the SparK method:

SparK Starting with a dataset of images ${I_{1}, I_{2}, I_{3}, \dots}$ , each image is divided into non-overlapping square patches, where each patch is masked independently with a given probability. The probability is a hyperparameter called “mask ratio”. The images are converted to spars images ${S I_{1}, S I_{2}, S I_{3}, \dots}$ by sparsely gathering all unmasked patches. As shown in Fig. Fig.3,3, the SparK model consists of an encoder, which can be any convolutional model and a decoder. The encoder is transformed to perform submanifold sparse convolutions^⁵³. Submanifold Sparse convolutions only compute when the center of a sliding window kernel is covered by a non-empty element. This causes the encoder to skip all masked places. The decoder is built in a U-Net^⁵⁵ design with three blocks of upsampling layers, receiving feature maps from the encoder in four different resolutions. This is referred to as “hierarchical” encoding and decoding. The empty parts of the feature maps from the encoder are filled with learnable mask embeddings M before being computed by the decoder to obtain dense feature maps. This is called “densifying”. SparK further adds a projection layer between the encoder and decoder for all computed resolutions in case they have different network widths. To reconstruct the image, a head module is applied after the decoder with two more upsampling layers to reach the original resolution. The model is trained with an L2 Loss between the predicted images of the model ${I_{1}^{*}, I_{2}^{*}, I_{3}^{*}, \dots}$ and original images ${I_{1}, I_{2}, I_{3}, \dots}$ , computed only on masked positions. After the pre-training, only the encoder is used for the downstream tasks. The sparse convolutions of the encoder can be applied directly to non-masked images without modification since normal convolutions are performed when the images have no masked patches.

Illustration of the masked autoencoder method SparK for self-supervised pre-training. $I_{1}$ is the input image which is divided into patches that are randomly masked and sparsely gathered to the sparse masked image $S I_{1}$ . The masked image is computed by a U-Net shape encoder-decoder model that is trained to reconstruct the original image $I_{1}$ . The encoder performs sparse convolutions that only compute when the center of a sliding window kernel is covered by a non-masked patch.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol