Preprocessing of SEER breast cancer data

MR Marcel da Câmara Ribeiro-Dantas
HL Honghao Li
VC Vincent Cabeli
LD Louise Dupuis
FS Franck Simon
LH Liza Hettal
AH Anne-Sophie Hamy
HI Hervé Isambert
ask Ask a question
Favorite

There are 407,791 breast cancer records in SEER for the period 2010-2016, but only 396,179 distinct patients due to multiple breast primary tumors for some patients. For each patient, we selected the first breast primary tumor recorded in SEER and indicated the total number of breast cancer primaries during the 2010-2016 period in the variable MoreThanOneBCPrimary. SynchroBilateral was also engineered to identify patients who had tumors in both breasts diagnosed within less than 180 days of each other, while Contralateral identifies patients who had a subsequent tumor in the other breast diagnosed more than 180 days after the first breast tumor primary. Some categorical variables had some of their categories merged, either because these categories had the same general meaning or because they were too rare amongst patients (i.e. < 0.1% of patients excluding those with missing data for the considered variable). These variables include Ethnicity, TypeSurgeryPrimitiveSite, Surgery, OtherSurgery, OtherMetastasisAtDiagnosis, Insurance and Histology. Hence, categories recorded in less than 0.1% of patients were merged and renamed to ‘Other’. BreastReconstruction was engineered based on TypeSurgeryPrimitiveSite (i.e. SEER surgery code ranges 43-49, 53-59, 63-69, and 73-75 were assigned ‘Yes’, while other surgery codes were assigned ‘No’). Radiotherapy was created from Radiation sequence with surgery, that has much less missing data (0.05%) than the original Radiation variable (49%). TumorSize merges two distinct variables that contained tumor sizes for years 2004-2015 and 2016+, respectively. Likewise, the largely missing 2016 information for the MetastasisAtDiagnosis variable was recovered based on information contained in specific metastasis variables (i.e. BoneMetastasisAtDiagnosis, LungMetastasisAtDiagnosis, LiverMetastasisAtDiagnosis, BrainMetastasisAtDiagnosis, OtherMetastasisAtDiagnosis). Finally, MedianFamIncome and MedianHouseHoldIncome are the average of these continuous variables over the periods 2007-2011, 2008-2012, 2009-2013, 2010-2014, 2011-2015, and 2012-2016. The script implementing these preprocessing steps is provided as Data S1.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A