We developed tbiExtractor, which extends pyConTextNLP [28] to create a framework for extracting TBI common data elements from radiology reports [3]. tbiExtractor inputs a non-contrast head CT radiology report and outputs a structured summary containing 27 common data elements with their respective annotations. For example, subdural hemorrhage (common data element) is PRESENT (annotation). Code and data files to implement tbiExtractor, along with a Jupyter notebook tutorial, are available at https://github.com/margaretmahan/tbiExtractor.
Based on a regular expression algorithm called NegEx [32], which uses negation detection (e.g., no evidence of intracranial pathology), the ConText [33,34] algorithm captures the contextual features surrounding the clinical condition by relying on trigger terms and termination clues. A more extensible version of the ConText algorithm was implemented in Python, pyConTextNLP [28], and offers added flexibility for user-defined contextual features and indexed events (e.g., specific clinical conditions) [35].
As a lexicon-based method, pyConTextNLP inputs tab-separated files for lexical targets (indexed events) and lexical modifiers (contextual features). It then converts these into itemData, which contains a literal, category, regular expression, and rule (the latter two are optional). The literal, belonging to a category (e.g., ABSENT), is the lexical phrase (e.g., is negative) in the text. The regular expression allows for variant text phrases (e.g., was negative) giving rise to the same literal and is generated from the literal if not provided. Further, the rule provides context to the span of the literal (e.g., backward).
For text data, pyConTextNLP marks the text with lexical modifiers and lexical targets according to their representative itemData. The pyConTextNLP algorithm outputs a directional graph via NetworkX [29] which represents these markups. Nodes in the graph represent the concepts (i.e., lexical modifiers and lexical targets) in the text and edges in the graph represent the relationship between the concepts.
The following three subsections will describe the details used for extending pyConTextNLP.
Lexical modifiers were adapted from a pyConTextNLP application to CT pulmonary angiography reports [35]. Modifications in deriving the final lexical modifiers are as follows:
The literal is a lexical phrase (e.g., was not excluded). Literals were added and removed during the training stage.
The category is what the literal refers to (e.g., INDETERMINATE). Each literal was assigned a category before the initialization stage and updated during the training stage. The categories used for this study are PRESENT, SUSPECTED, INDETERMINATE, NOT SPECIFIED, ABSENT, NORMAL, and ABNORMAL. Henceforth, the term "annotation" will be used when referencing the category to maintain consistency between annotators and algorithm vocabulary.
The regular expression is used to find variant text phrases (or patterns) for the same literal (e.g., the regular expression: (was|were)\snot\sexcluded, would find sentences with "was not excluded" and "were not excluded"). Regular expressions were added and updated during the training stage.
The rule dictates the span of the literal (e.g., backward). Each literal was assigned a rule before the initialization stage and updated during the training stage. The rules used for this study are forward, backward, and bidirectional.
Lexical targets were adapted from the common data elements in radiologic imaging of TBI [3]. These included pertinent clinical findings in the acute phase of TBI across all severities. By utilizing an array of specific pathologic features (e.g., subarachnoid hemorrhage, subdural hemorrhage, epidural hemorrhage, and intraparenchymal hemorrhage) our framework allows TBI researchers to dynamically categorize subjects and evaluate the significance of pathological patterns and their impact on cerebral tissues. In deriving the lexical targets, the literal represents a clinical condition relevant to TBI on a non-contrast head CT scan (e.g., microhemorrhage) and the category, in this study, is the same (e.g., MICROHEMORRHAGE). The regular expression for each literal (e.g., microhemorrhage(s)?) was added and updated during the training stage.
Two examples (Figs (Figs22 and and3)3) are provided for detailed explanation of the application of lexical modifiers and lexical targets during the algorithm process.
To implement tbiExtractor, each cleaned radiology report was converted to a spaCy [26] container and subsequently partitioned into sentences. Using pyConTextNLP [28], each sentence was marked with lexical modifiers and lexical targets according to their representative itemData. Following the markup, concepts that are a subset of another concept, within the same concept type, are pruned (span pruning). For example, if the text contained the phrase “findings do not appear significantly changed”, the lexical modifier not would be pruned and the lexical modifier do not appear significantly changed would be retained. Then, for the marked lexical targets, the lexical modifiers are applied. Lexical modifiers that are not linked to a lexical target are dropped (modifier pruning). For multiple lexical modifiers for the same lexical target in the same sentence, the nearest lexical modifier by character length is chosen (distance pruning). For example, if the text contained the phrase “multifocal subarachnoid hemorrhage as described above most notably in the right sylvian fissure”, the lexical modifier multifocal would be selected via distance pruning over the lexical modifier in the since it is closer in character length to the lexical target, subarachnoid hemorrhage. Span and modifier pruning are part of the pyConTextNLP implementation. Distance pruning was added as part of tbiExtractor.
At this stage of processing, each sentence in the radiology report will be marked with lexical targets and linked lexical modifiers. There will be one lexical modifier assigned to one lexical target.
A radiology report may have duplicate lexical targets if identified in multiple sentences within the radiology report or a radiology report may not have any lexical targets indicated. To mitigate this, tbiExtractor employs decision rules. First, for each radiology report, omitted lexical targets are added with the default annotation of NORMAL for gray-white matter differentiation and cistern lexical targets and annotation of ABSENT for the remaining 25 lexical targets (omitted targets). Second, if duplicate lexical targets are identified, the majority vote is selected (duplicate targets). For example, if a lexical target appears in the radiology report three times and the lexical modifiers for two occurrences have an annotation of ABSENT and the other has an annotation of PRESENT, tbiExtractor will choose ABSENT. Similarly, if there are two lexical modifiers with an annotation of PRESENT, two with ABSENT, and one with SUSPECTED, tbiExtractor removes SUSPECTED based on the majority vote. However, the annotations PRESENT and ABSENT require further decision rules because no majority exists.
In the case where no majority exists, the first lexical modifier in the ordered annotation list is selected. If the lexical target is extraaxial fluid collection, hemorrhage not otherwise specified (NOS), or intracranial pathology, the ordered annotation list is: ABSENT, INDETERMINATE, SUSPECTED, PRESENT, NORMAL, ABNORMAL. For all other lexical targets, the ordered annotation list is: PRESENT, SUSPECTED, INDETERMINATE, ABSENT, ABNORMAL, NORMAL. Following this, annotations that are not in the set of annotations for that lexical target are replaced with their predetermined counterpart (e.g., if the lexical target cisterns has an annotation of ABSENT, the annotation is replaced with NORMAL). At this stage of processing, each lexical target has one annotation for the entire radiology report.
The annotations for three lexical targets can be altered based on the annotations of other lexical targets in the same radiology report. Thus, a second set of derived decision rules are applied by tbiExtractor (derived targets). First, if epidural hemorrhage, subdural hemorrhage, or subarachnoid hemorrhage, are PRESENT or SUSPECTED, hemorrhage (NOS) is annotated ABSENT. Second, if epidural hemorrhage, subdural hemorrhage, or subarachnoid hemorrhage, are PRESENT, extraaxial fluid collection is annotated PRESENT. If these lexical targets were annotated SUSPECTED, and extraaxial fluid collection was annotated ABSENT by default, then extraaxial fluid collection is annotated SUSPECTED. If gray-white differentiation, cistern, hydrocephalus, pneumocephalus, extraaxial fluid collection, midline shift, mass effect, diffuse axonal injury, anoxic, herniation, aneurysm, contusion, brain swelling, ischemia, hemorrhage (NOS), intraventricular hemorrhage, or intraventricular hemorrhage are annotated PRESENT, SUSPECTED, or ABNORMAL, then intracranial pathology is annotated PRESENT.
Omitted, duplicate, and derived targets were implemented as part of the tbiExtractor. At the end of the above processing steps, each radiology report will have a list of 27 lexical targets each with one annotation, which constitutes the structured summary output.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.