It can be seen from the Figure 2 that the missing values in the data are all on the two variables: age and sex.
Variable outlier detection chart. Only the variables Sex and Age have missing values.
The variables Sex and Age have missing values, and there is a synergy between them, they all have 837,877 missing items.
Records containing missing values account for less than 1% of the total. We found that the missing data only appeared in Sex and Age and appeared in pairs. We carefully investigated the reasons for this situation and traced back to before data desensitization. We found that the lack of Sex and Age of some passengers was caused by the purchase of tickets using passports and other identification certificates. We cannot accurately estimate the Sex and Age value of these passengers, so the interpolation method is not appropriate and may cause deviations in the results. Therefore, we finally choose to delete the incomplete records directly.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.