Our overall computing framework is Spark. The bus ticketing data contains ten variables. Taking Shandong Province as an example, the original data are shown as Table 1.

Taking Shandong as an example to display the dataset. We only intercepted the first 10 lines.

The Table 2 showed the meanings of the ten variables:

Variable description. The dataset include 10 variables.

SeatType is almost irrelevant to the problem we want to investigate, so we choose to delete this variable. Then we deleted the station ID and reserved the station names. Because our data comes from multiple bus systems, and the codes of different systems are different. The coding standards are not uniform. So we use the station name as the station’s identity.

It can be seen from the Figure 2 that the missing values in the data are all on the two variables: age and sex.

Variable outlier detection chart. Only the variables Sex and Age have missing values.

The variables Sex and Age have missing values, and there is a synergy between them, they all have 837,877 missing items.

Records containing missing values account for less than 1% of the total. We found that the missing data only appeared in Sex and Age and appeared in pairs. We carefully investigated the reasons for this situation and traced back to before data desensitization. We found that the lack of Sex and Age of some passengers was caused by the purchase of tickets using passports and other identification certificates. We cannot accurately estimate the Sex and Age value of these passengers, so the interpolation method is not appropriate and may cause deviations in the results. Therefore, we finally choose to delete the incomplete records directly.

Use StartStationName and ReachStationName to match provinces and cities.

StartStationName corresponds to the outflow, and ReachStationName corresponds to the inflow. We find that the outflow and the inflow are relatively equal in the overall data set. So we chose to use StartStationName to match the location information, and the outflow was the main research problem.

We use string extraction to get the city from the station name.

Note this feature: most records contain specific cities and provinces. Only a few records do not contain clear location information (for example, “South Station”), and we deleted them directly.

Then in order to obtain accurate location information, we use Baidu Map API to get the latitude and longitude of the station.

We use the R package “baidumap” to connect to the Baidu Map API and obtain the key through the Baidu Map open platform to extract the latitude and longitude of the region. Among them, data processing and using API to retrieve regional longitude and latitude operations are implemented through parallel computing, which explicitly involves the contents of packages such as “foreach”, “parallel”, and “future apply”.

Our location matching work can be clearly shown in Figure 3:

Location matching progress.The main steps are the extraction of cities and the acquisition of latitude and longitude.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.