A rule-based NLP algorithm was developed and iteratively improved using the training dataset. The algorithm was implemented on a NLP system that was internally developed by KPSC which was based on NLTK [18], pyConText/NegEx [19], and Stanford NLP [20]. The final NLP program was executed locally at each participating site. Results without protected health information were sent back to KPSC for analysis.
First, the clinical notes were pre-processed through section detection, sentence separation, and tokenization (i.e., segmenting text into linguistic units such as words and punctuation). Second, keywords were compiled based on published case definitions and ontologies [21], and enriched by the training data to capture additional linguistic variations such as abbreviations and misspellings (Appendix B). Third, using these compiled terms, pattern matching was used to identify vaccination, S/S of local reaction, site(s) of vaccination and reaction, and cause. Negated terms and pre-existing conditions were identified and excluded. The site of reaction was compared to the vaccination site coded in the structured data, and excluded if the sites did not match. S/S with causes other than Tdap or clearly stated to be unrelated to vaccination were excluded. To identify a possible relationship between the outcome (local reaction) and the cause (e.g., Tdap), spatial information (e.g., specific body location) and temporal information (e.g., onset time) were also captured. The evidence (Table 2) identified was combined and assigned an output level between 1 and 8, with smaller values indicating stronger probability of being a true positive case (Table 3). A sample clinical note with NLP-identified concepts and relationship is provided in Appendix C.
Types of evidence identified by NLP from each clinical note.
Levels of output from the NLP system for combination of evidence types.
For the Day 0 search, only output Levels 1–5 were treated as positive to increase specificity. The NLP algorithm was further modified to determine the temporal relationship between vaccination and S/S occurring on the same day. Since the vaccination data only contained the date without the time of vaccination [22], the timestamp of the notes was used to determine the sequence of events. Tdap was routinely administered at the end of a clinical encounter which typically lasted 15–30 minutes. Therefore, we grouped notes within a 30-minute window and treated them as a single encounter. Vaccine-related local reactions often were documented in a follow-up encounter. The S/S identified by NLP in the first encounter on Day 0 were classified as pre-vaccination symptoms since the vaccination likely had not yet been given based on chart review. The S/S identified in later encounters were classified as post-vaccination symptoms.
For the broad search, without the restriction of diagnosis codes, the S/S identified in Level 8 were often not caused by vaccine. Therefore, only output Levels 1–7 were considered true positive cases.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.