Although multiple human annotations are the reference standard for ground truth generation, they are costly and limit dataset size (19). To perform labeling at scale, we used multiclass natural language processing (NLP) tools to classify local radiology reports into one or more of 14 categories. These categories included the following: (a) 12 pathology classes; (b) a support device class; and (c) a “no finding” class, a catch-all category indicating the absence of any clinically relevant abnormality. Because findings on chest radiographs represent a “long-tailed” distribution with few common findings and many uncommon findings, the “no finding” class was trained by the NLP tool developers to represent 53 additional findings beyond the 12 pathology classes (eg, osteopenia, aortic aneurysm) (20,21). If the “no finding” class was positive, the image was considered to be absent of any abnormality, including the “long tail,” and was labeled normal; if the “no finding” class was negative, abnormality was present, and the image was labeled abnormal.
We validated the performance of two open-source NLP classifiers (the CheXpert [1] and CheXbert [22] NLP labelers) on radiology reports at our institution by comparing predicted classifications to manual annotations. In contrast to original studies that used only the report’s summary or conclusion, our analysis included the entire report because of variation in report structures across institutions and radiologists. For this analysis, Trillium Health Partners radiology reports were manually classified by two independent investigators (M. Ahluwalia and J.S., both 3rd-year medical students), and conflicts were resolved through consensus. Reports that solely described a lack of interval change, referred to findings in other scans, or could not be interpreted by one or more NLP or image classifiers were excluded. A total of 502 reports were included (similar in number to CheXpert Image’s test-set size of 500) (1).
To measure the effect of NLP error on image classification performance, we compared the performance of each image classifier on the 502-image dataset against each type of report label (ie, manual, CheXpert, and CheXbert). Recognizing that radiology reports incorporating information from previous scans may bias NLP algorithm performance, we also determined image classifier performance on chest radiographs that were the first for a patient in the dataset (ie, with no comparison). This analysis was restricted to emergency department and outpatient chest radiographs.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.