To evaluate the visual appearance of the synthetically generated images, we conducted a survey with five radiologists. Their radiological experience ranged from less than one year to over 30 years with a median experience of 3 years.
Using a custom browser-based presentation tool two image patches were presented next to each other to the participants. We measured the classification accuracy and the decision time for each participant. They only received minimal instructions and were not informed about the time measurements. For each experiment we used 40 patches from the inhouse data. In the first experiment 20 synthetic images had a lesion that was inserted into a previously patch of normal liver tissue and the remaining 20 synthetic images were modified such that a real lesion was removed. Modified patches were selected manually. That means, only such patches were selected that contained a lesion in a suitable position for being removed or that provided a suitable spot for inserting a lesion. However, once the selection was fixed, the modification was applied without discarding patches that showed any artifacts or abnormalities, aiming for a patch selection as unbiased as possible for the first experiment. The data for the second experiment consisted of 40 randomly selected patches.
To assess the inter-rater reliability, we computed Fleiss’ kappa. We modeled the experiments as binomial processes to determine if rater decisions were above the chance level for a given significance level defined with p < 0.05. Further, by combining the answers using majority voting we obtained an ensemble rater that was also tested against chance level. To assess the difference in reaction times we used two-sided t-tests, deliberately without correction for multiple comparison (Appendix A).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.