From a data-driven perspective, the disparities between the digital histopathology image stainings caused by the changes in preparation methods and scanners make the distribution of the generated images differ. The training of machine learning models with only images created from a subset of preparation methods, scanners or centers creates vulnerable models that might fail at correctly classifying near out-of-distribution samples with staining changes. Discriminative learning methods for classification, such as neural networks or support vector machines, perform well when training and test data are drawn from the same distribution. Therefore, when a model is trained using a set from one distribution and then tested in another, its performance will be hindered by the dissimilarity between the training and test distributions. This is a well-known problem in machine learning called domain adaptation (DA). It is an active research area in machine learning where many novel frameworks have been proposed (Crammer et al., 2006; Ben-David et al., 2007; Tzeng et al., 2017).
In Ben-David et al. (2007), the authors provided a framework to analyze the contributions of domain adaptation for the generalization of models by learning features that account for the domain disparity between training and test set distributions. The theory of DA suggests that a good representation for cross-domain transfer is one in which the algorithm cannot learn to identify the domain of origin of the input observation. This led to the authors of Ganin et al. (2016) to propose a concrete implementation of this idea in the context of deep neural network models. The objective of domain adversarial neural network models (DANN), is to learn features that do not take into account the domain of the training samples. The domain adversarial features combine discriminativness and domain invariance into the same representation. To build such representations, two classifiers are involved: (i) the label classifier that predicts the main task classes (e.g., mitosis/non-mitosis) at training and testing time, and (ii) the domains classifier that discriminates between several domains. Both classifiers can be tied using the same set of features. Having two outputs with separate loss functions, one that measures the error at classifying the class of the sample correctly and the second one,measuring the error at classifying the origin of the sample (i.e., the domain).
The optimization objective of a DANN model is to find a saddle point solution of parameters for the task classifier , domain parameters and domain-invariant features . The loss function for the N = n + n′ training samples is:
where and are the task and domain loss functions, respectively. Both n and n′ are drawn from the dataset with the difference that n are training samples for which the task label is available, where for the n′ samples only the domain is known. It is worth noting that n′ are the samples with different domains in the training but can include samples from the test set, from which only the domain label is needed to adapt the shared feature representation. In Ganin et al. (2016) the authors find that such saddle points can be found using the following iterative stochastic gradient updates:
Where μ is the base learning rate for the task classifier and λ is the domain learning rate multiplier that allow to learn the domain-dependant features at a different pace from the task-related ones. Equations (5) and (6) optimize the parameters for the task and domain loss functions, respectively, as in a classical multi-output DCNN scenario. The loss for the task classifier is minimized while the loss for the domain classifier is maximized. This is done in Equation (4) that is optimized in an adversarial manner by going in the opposite gradient direction that minimizes task loss and in the positive direction of the gradient for the domain features, maximizing the domain classifier loss.
A simplified architecture scheme is shown in Figure 7, where the task and the domain classifiers are at the top of a shared representation that maximizes the classification performance to separate between mitotic and non-mitotic images and at the same time minimizes the probability to recover in which center (domain) the image was generated. If the model converges, then for a test image the inferred representation avoids to include any center-specific information. The original center for the test image could be unknown, since this label is not needed and we are only interested in obtaining the mitosis/non-mitosis probability.
Domain adversarial scheme: A domain-balanced batch of images is passed as input to the network that has two types of outputs: the task classification output and the domain classification output. The shared representation θf is optimal for the task classification and unable to discriminate between the n domains.
The main characteristic of DANN models is that they do not require external models to perform staining normalization or color augmentation but are the task labels and the indicators from the scanners or domains that are used to drive the training process and solve the task without depending explicitly on such models. This feature is also a potential drawback if one is interested in a staining quantification scenario, because DANN models do not output an explicit normalized image but the normalization is done in the feature space instead. This group samples from the same class regardless of the domain from where they were generated. Our experiments are inspired by the work of Lafarge et al. (2017) where the authors build a DCNN with a reverse gradient layer that aims to learn in an adversarial manner the mitotic probability of patches and their domain.
It was noted by Engilberge et al. (2017) that in the context of DCNN, the models are sensitive to color changes and also that in the more common architectures, the color-sensive units are located in the first layers of the network, which suggests that the stain-dependent information of the network should be removed from the first layers of the architectures, this lead us to place the reversal gradient information before the fully connected layers of the network as shown in the experimental setup section 3.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.