Evaluation of Classification Performance

HA Hazem Abdelmotaal
AA Ahmed A. Abdou
AO Ahmed F. Omar
DE Dalia Mohamed El-Sebaity
KA Khaled Abdelazeem
ask Ask a question
Favorite

Deep neural networks trained with a combination of real and synthesized images have a potential advantage over networks trained with real images alone, including a larger quantity of data, better-diversified datasets, and preventing overfitting.

To gauge the performance gains obtained by employing pix2pix cGAN-based image augmentation, we benchmarked the images synthesized by the employed algorithm using the VGG-16 network. The VGG-16 (also called OxfordNet) is a convolutional neural network that is 16 layers deep named after the Visual Geometry Group from Oxford, who developed it. It was used to win the ImageNet Large Scale Visual Recognition (iLSVR) Challenge in 2014.28 The VGG-16 is widely employed in several medical image classification tasks. The pretrained VGG-16 DCNN with ImageNet weights was used and customized for image classification. After modifying the input tensor shape of the top dense layer, thereby forcing the model to accept the shape 512 × 512 of the input images, the last classifying layers of the model were truncated and replaced by a flattened layer followed by two fully connected layers (64 nodes- dense 1 and 2) separated by a dropout layer and followed by a final fully connected (3 nodes- dense 3) layer with softmax activation adapted to output the three image classes. The model architecture is shown in Figure 3. The model was initialized with ImageNet weights. Model hyperparameters were fine-tuned manually searching for the best values of momentum, dropout rate, and learning rate that fit with various input data instances. This entailed using a first momentum term (β1) between 0.9 and 0.6 with the default second momentum term (β2) = 0.999. The learning rate was reduced by a factor of 0.2 when validation loss stops improving for three epochs with a lower bound of learning rate = 0.001 to prevent increasing loss at high learning rates with a subsequent drop in accuracy. The dropout rate used ranged between 0.1 and 0.3.

The proposed custom VGG-16 model architecture used for 512 × 512 pixel image classification (created by Hazem Abdelmotaal). Conv2D = convolution layer + ReLU activation; Dense 1, 2 = fully connected layers + ReLU activation; Dense 3 = fully connected layer + Softmax activation.

We trained the aforementioned classifier using six different combinations of the original training set and synthesized 4-map refractive display images (Fig. 4).

Balanced original dataset (BO): A balanced version of the original training image samples, where the number of 4-map refractive display image samples per class was set to the maximum number of available original training image samples in the least represented class (K).

Imbalanced original dataset (IO): All available original training image samples were used.

Imbalanced original dataset with traditional augmentation (IOA): All available original training image samples were augmented by artificially increasing their number using traditional image augmentation, by slight rotation, width-shift, height-shift, and scaling, without using vertical or horizontal flip, to avoid unrealistic image deformation.

Imbalanced original dataset partly augmented with synthesized images (IOPS): A concatenation of all available original training and augmented synthesized images in E and K classes, to enlarge the number of image samples to equal the maximum number of available original training images in the most plentifully represented class (N).

Imbalanced original dataset fully augmented with synthesized images (IOFS): A concatenation of all available original training and augmented synthesized images in all classes to enlarge the number of image samples per class to twice the maximum number of original training images in class N.

Balanced synthesized dataset (BS): A balanced version of 3000 synthesized images representing each class (total 9000 images) was used for training without using the original training dataset. As the pix2pix model outputs images with 512 × 512 resolution, no rescaling of synthesized images was needed before combining with original images. During training, we used a 0.3 validation split, which is the fraction of the training data to be used as validation data.29 The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. Class weights were fed to the model, therefore, imposing a cost penalty on the minority class misclassification. These penalties ask the model to pay more attention to minority classes preventing the model from being biased toward the majority class. The number of training iterations (epochs) was set to 10. The model was implemented using Keras 2.3.1 and TensorFlow 2.0.0 libraries.20,22 Each trained model was used for the classification of the test set only once, and classification metrics were recorded.

The number of samples per training dataset. BO: balanced original dataset; BS: balanced synthesized dataset; IO: balanced original dataset; IOA: imbalanced original dataset with traditional augmentation; IOFS: Imbalanced original dataset fully augmented with synthesized images; IOPS: imbalanced original dataset partly augmented with synthesized images; E: early keratoconus images; K: keratoconus images; N: normal cornea images.

The VGG-16 classifiers’ performance on the test set was analyzed, based on model accuracy, precision, recall, F1 score, and receiver operating characteristic curve (ROC) analysis.30

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A