# Also in the Article

First approach: Self-supervised neural network

Procedure

In our initial attempt, we designed a self-supervised neural network that learns how to convert (approximately) an RGB image onto an x-ray image. Figure 6 depicts a high-level abstraction of this proposed approach. Explicitly, our approach was based on the following principles:

1. The function fx( ⋅ ) : ykxk maps the visual image associated with detail k onto the corresponding x-ray image.

2. The function fx is implemented using a CNN.

3. The function is being learned by minimizing$∥x−(fx(y1)+fx(y2))∥F2$(2)so that conceptually, the mapping fx( ⋅ ) : ykxk is converting an RGB image onto a corresponding x-ray in such a way that the linear superposition of the generated x-ray images corresponds to the available mixed x-ray.

4. The input corresponds to patches taken from y1 and y2, and the self-supervision is achieved through optimizing fx with respect to the counterpart patch from x.

The original images y1, y2, and x were taken as a collection of 64 × 64 patches with an overlap of 52 pixels resulting overall in roughly 966 and 3168 patch triplets for details 1 and 2, respectively. That is, the input data were organized as RGB N patches $(y1j,y2j)∈ℝ64×64×3×ℝ64×64×3$ with the corresponding target patches xj ∈ ℝ64 × 64 × 1. We then constructed a seven-layer CNN along with batch normalization and rectified linear unit (ReLU) activation layers in between each of the convolution layers. The structure of the proposed network was inspired by the structure of pix2pix, which is an acceptable design for image-to-image translation using conditional adversarial network (38). Since its release, the pix2pix network model has attracted the attention of many internet users including artists (39). In our case, because of the lack of training data, we were unable to perform supervised adversarial training. Hence, we used only the “generator” network, and after experimenting with various structures, we observed that using only the encoder part of the generator provides the best reconstruction for x-ray images. Furthermore, our model deliberately overfitted the data as we were training and testing with the same dataset (i.e., a self-supervised learning). Therefore, we avoided using any sort of regularizer in the network structure.

For each of the seven convolutional layers (denoted by l1, l2, …, l7), we performed convolution with masks ${Mk,i}k=1Ni$, where the size of each mask was 5 × 5 × Ni − 1. Accordingly, the output of each of these layers would be Ni patches of size 64 × 64. We used N0 = 3, as the input layer comprises RGB color patches; for i = 1,2,3 we used Ni = 128, and for i = 4,5,6 we used Ni = 256; lastly, in the final layer providing the reconstructed x-ray, N7 = 1 to achieve a single 64 × 64 patch as the final outcome (see the network architecture on Fig. 6). Explicitly, given an input patch p ∈ ℝ64 × 64 × 3, the output of the layers is defined as(3)where li ∈ ℝ64 × 64 × Ni − 1 comes from stacking the li, k after batch normalization and activation, l0 = p, and ci, k is a bias scalar valued parameter.

The learning process of the neural network aims at finding the most fitting entries of ${Mk,i}k=1Ni$, as well as ci, k. The optimization of these parameters, with respect to the cost function of Eq. 2, was done through random initialization and performing 300 iterations of stochastic gradient descent. A schematic drawing of the CNN architecture is shown in Fig. 6.

As a result of the network’s design, the resolution of the output images was the same as that of the input images. As can be seen in Figs. 3 and 4 (column B), the results yielded by this process gave a seemingly clean reconstruction of x1 and a substantially worse reconstruction of x2. However, even this result already improved upon other techniques designed to deal with the same problem (see Fig. 5). To check how faithful the reconstruction was to the mixed x-ray, we measured the MSE of the difference between the original mixed x-ray image and the summation of the two reconstructed separate x-ray images. The reconstruction MSE achieved by this approach was 0.0094 and 0.0053 (for grayscale values ranging between 0 and 1) when applied to details 1 and 2, respectively. The corresponding reconstruction mean absolute errors achieved by this approach were 0.0464 and 0.0297.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A