Data pre-processing and feature extraction

WS Wonju Seo
YL You-Bin Lee
SL Seunghyun Lee
SJ Sang-Man Jin
SP Sung-Min Park
ask Ask a question
Favorite

Each CGM time series was presented as a sequence where the ith CGM time series is given by:

where Ni is the length of CGMi,:. For each time series, missing CGM data points were interpolated by the spline method [20] only if less than 3 CGM data points were missing consecutively. The missing CGM data points were reported when the device fails its calibration process [32]. The CGM measurement is taken at every 5 min, and thus CGMi,t=n means that the CGM data point at 5 ×nth minute of the ith CGM time series. In our study, we took CGM data points after meal announcements and each CGM data point is represented in Eq. 2.

where meali,j is the time of the jth meal announcement of the ith CGM time series, and W is the postprandial period.

We first analyzed the CGM trends of all selected patients’ data to identify meaningful features for postprandial hypoglycemia. A subset of patients experienced postprandial hypoglycemia if they had a small peak or no peak in CGM, probably due to the meal being small or containing only a small portion of carbohydrate (Fig. 1a and b). There was another group of patients experienced hypoglycemia when the CGM increased steeply and then dropped right after the peak; this reaction probably occurred when the patients ingested carbohydrates with high glycemic index or when the pre-meal rapid-acting insulin was injected too late (Fig. 1c). Insulin injected before a preceding meal can affect a glucose level after the meal. In other cases, a decrease in CGM, in spite of meal ingestion, may have been caused by the insulin on board and was associated with future hypoglycemic episodes (Fig. 1d).

Representative CGM time-series data to show different reactions of selected patients’ glucose levels after meals. Blue line: CGM time-series data points; red line and transparent red box: CGM data point <3.9 mmol/L (70 mg/dL); magenta filled circle: CGM data point at the meal; red filled circle: peak CGM data point after the meal; green filled circle: CGM data point at the time of prediction. Clinical explanations: a No peak of CGM data point could occur because the patient ate a small amount of carbohydrates in the meal. b Low peak after the meal, then rapid fall in glucose could occur because patient ate a small amount of carbohydrates in the meal. c Steep peak, then rapid fall in glucose could occur when the patient ate foods rich in carbohydrate with high glycemic index or injected rapid-acting insulin later than he or she should have. d A rapid fall and then no peak after the meal could occur when the insulin injected before the previous meal is still active (insulin on board)

We used above observed data points to define features for predicting hypoglycemia near mealtime. The first feature is defined as ‘the rate of increase in glucose’ (RIG), which is the rate of glucose increase from a meal to a peak:

where CGMi,j,peakt is the highest CGM data point between the time of the jth meal announcement of the ith CGM time series and prediction time t, CGMi,j,0 is a CGM data point at the jth meal announcement, and TDmealtopeak is time difference between the meal announcement to the peak. The RIG is updated until the peak CGM data point is found after the meal announcement. If there is no peak CGM data point, the RIG is set to 0. According to the definition of the RIG, zero implies that there is no increase in glucose after the meal.

Since the change in CGM data points is large before hypgolycemia occurs (Fig. 1), we defined the second feature glucose rate of change (GRC) as:

where CGMi,j,t is a CGM data point at the time of prediction from the jth meal announcement of the ith CGM time series, and CGMi,j,t−1 is the CGM data point immediately prior to the time of prediction. Since the GRC calculates the near-instantaneous changes in CGM data points around the time of prediction, it can be used to predict hypoglycemia [26, 33]. The third feature is defined to be the CGM data point at the time of prediction (CGMi,j,t) from the jth meal announcement of the ith CGM time series. To define labels, we took into account the presence of a hypoglycemia alert value [34, 35] at the 30-min prediction horizon (i.e., CGMi,j,t+6). If CGMi,j,t+6< 3.9 mmol/L (70 mg/dL), we set Labeli,j,t=1. Otherwise, we set Labeli,j,t=0 (Fig. 2).

The three features and the 30-min prediction horizon. Blue line: CGM time-series data points; red line: CGM data point <3.9 mmol/L (70 mg/dL); magenta filled circle: CGM data point at the meal; red filled circle: peak CGM data point after the meal; green filled circle: CGM data point at the time of prediction; black arrow: rate of increase in glucose (RIG); red arrow: glucose rate of change (GRC); transparent yellow box: observational window; transparent green box: the 30-min prediction horizon

We obtained all available CGM data points between 5 min and 3.5 h post mealtime announcements (i.e., from CGMi,j,1 to CGMi,j,42). The corresponding hypoglycemia alert values that occur from 35 min to 4 h after meal announcements were included (i.e., from Labeli,j,1 to Labeli,j,42). Although postprandial hypoglycemia can occur later than 4 h after each meal, we chose the window of 35 min to 4 h after the meal because including longer duration after the meal to this time window decreases the prediction accuracy of the algorithm. Since there are already well-established algorithms for predicting fasting or nocturnal hypoglycemia [25, 36], a clinical need of a dedicated algorithm for predicting postprandial hypoglycemia would be most important during the first 4 h after each meal, which is typically difficult to cover using the existing nocturnal hypoglycemic prediction algorithms developed in the setting of gradual changes of blood glucose levels.

The data processing and the feature extraction were performed using the following steps : First, from the ith CGM time series, the jth meal announcement is selected and the CGM data points from CGMi,j,1 to CGMi,j,42 were sampled. Second, from the sampled series, CGMi,j,t,RIGi,j,t, and GRCi,j,t features were extracted while increasing t from 1 to 42. The label information is obtained from the CGM data point with the 30-min prediction horizon (i.e., CGMi,j,t+6).

The first and second steps were repeated for 107 CGM time series around mealtimes, and obtained samples : D={(CGMi,j,t,RIGi,j,t,GRCi,j,t,Labeli,j,t) with i=1,...,107, j=1,...,Mi, and t=1,...,42}, where Mi is the total number of meal announcements of the ith CGM time series. Before training our models, each feature values extracted were normalized with a MinMax Scaler.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A