As part of the national cancer surveillance mandate, the SEER cancer registries collect data on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, and first course of treatment. Tumor site and morphology are captured in the form of six key data elements—site, subsite, laterality, histology, behavior, and grade. These data elements are considered essential for SEER to provide an annual report on cancer incidence.
Our full dataset consists of 546,806 cancer pathology reports obtained from the Louisiana and Kentucky SEER cancer registries. Data was utilized under a protocol approved by the Department of Energy Central IRB. For our study, we use original pathology reports that did not go through de-identification; this study qualified for a waiver of subject consent according to 10 CFR 745.117(c).
Our dataset covers cancer cases of all types from Louisiana residents spanning the years 2004-2018 and Kentucky residents spanning the years 2009-2018. Each pathology report is associated with a unique tumor ID that indicates the specific patient and tumor for the report—each tumor ID may be associated with one or more pathology reports. For example, a patient may have an initial test to check for cancer at a particular site, secondary tests of neighboring organs to see if the cancer has spread, and a followup test to see if the cancer has developed.
Each unique tumor ID is tagged with aggregate ground truth labels for six key data elements—site, subsite, laterality, histology, behavior, and grade. These ground truth labels were manually annotated by a human expert with access to all data relevant to each tumor ID; this includes radiology reports and other clinical notes not available in our dataset. The SEER cancer registries require that each individual cancer pathology report be labelled with the aggregate tags belonging to its associated tumor ID. Therefore, all pathology reports associated with the same tumor ID will have the same labels. Each pathology report is labeled with one of 70 possible sites, 314 possible subsites, 7 possible lateralities, 4 possible behaviors, 547 possible histologies, and 9 possible grades; a detailed breakdown of number of instances per label is available in S1 Fig of our supporting information. A notable challenge in automated classification of cancer pathology reports, which is captured by our dataset, is identifying the correct aggregate-level labels for each report in a tumor ID sequence, even if some reports are addenda that may not contain the necessary information for all six data elements.
A large number of cancer pathology reports in our dataset are associated with tumor IDs that have only a single pathology report; in other words, these pathology reports do not have any case-level context because there is only a single report in the sequence. Because these reports do not require case-level context for analysis, they are filtered out of our dataset. After filtering, our dataset consists of 431,433 pathology reports and 135,436 unique tumor IDs; on average, each tumor ID is associated with 3.2 pathology reports. A more detailed histogram of the number of reports per tumor ID is available in S2 Fig of our supporting information.
To simulate a production setting in which a model trained on older, existing reports must make predictions on new incoming data, we split our dataset into train, validation, and test sets based off date. We first group pathology reports by tumor ID. If any tumor ID is associated with a report dated 2016 or later, all reports from that tumor ID are placed in our test set. On the remaining reports, we use 80:20 random splitting to create our train and validation sets, ensuring that reports from the same tumor ID are all placed in the train set or in the validation set without being split between the two. This yields a train set of 258,361 reports, a validation set of 64,906 reports, and a test set of 108,166 reports. Due to the long training time associated with deep learning models, cross validation is not used.
We apply standard text preprocessing techniques including lowercasing text, replacing hex and unicode, and replacing unique words appearing fewer than five times across the entire corpus with an “unknown_word” token. A more detailed description of our text cleaning process is available in our supporting information.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.