2.3. Dataset Preparation

NL Nicholas LeBow
BR Bodo Rueckauer
PS Pengfei Sun
MR Meritxell Rovira
CJ Cecilia Jiménez-Jorquera
SL Shih-Chii Liu
JM Josep Maria Margarit-Taulé
request Request a Protocol
ask Ask a question
Favorite

Several preprocessing steps were performed on the measurement data before it was used to train a classifier model. Incomplete measurement cycles in which not all beverages were recorded, or measurements of specific beverage samples that were much shorter than others, were removed entirely to preserve statistical balance with respect to both the total recording time per beverage and also the sequence of transitions between individual beverages. Any measurements lasting significantly longer than 5 minutes were truncated to that length.

The data from each recording session was filtered and normalized independently for several reasons. Not only were sensor offsets and sensitivities observed to change from one acquisition session to the next, but several sensors were replaced for the second (Na+, Cl, Ca2+) and third (H+, Cl) recording sessions due to performance degradation. Independent normalization additionally allowed to compensate for biases due to changes in ambient conditions between sessions. We removed any data points with corrupted beverage readings from one or more sensors, e.g., during transfer from one beverage to the next, sensor cleaning in deionized water, or accidental contact with reference electrodes.

The sensor readouts typically show large offsets corresponding to the various beverage types. The rate-based encoding scheme used when converting the trained classifier to a spiking network translates the constant offsets into dense spike trains, reducing the energy efficiency of the spiking model and masking the sparse dynamic signals which are more appropriate for neuromorphic processing. Encoding such high-magnitude offset components would also limit the range of signals that can be represented on low bit resolution systems commonly used in edge applications. Therefore, a configurable high-pass filter was used to attenuate level offsets in the input signals while emphasizing their dynamic components. Its transfer function was adapted to the MicroBeTa dataset by setting a pole at 0.5 mHz, a zero at the origin (i.e., at 0 Hz), and unity gain. In practice, values between 0.5 and 0.8 mHz were found to give accuracies higher than 90% in both ANNs and SNNs. Figure 3 illustrates the effects of the filter when applied to the dataset.

Dataset preprocessing transforms when using a high-pass filter with a cutoff frequency of 0.5 mHz, shown for the first 1,500 measurements in the dataset for a single ISFET sensor channel (Cl). Note the nonlinear enhancement of near-zero signals by the quantile scaler. Signal traces are not continuous in time everywhere due to the removal of invalid measurements and edge discontinuities between labels.

Outliers were removed following the filtering operation, and the resulting signal was subsequently normalized. Both operations were performed according to the statistics of training data only. Outlier values were deleted by excluding all measurements in which at least one sensor channel contained a value further than four standard deviations from the mean of that channel. Each sensor channel was then normalized independently using quantile normalization (Bolstad et al., 2003), which transforms the data to a normal distribution before nonlinearly mapping it to a uniform distribution on [0, 1]. Quantile-normalized data was found to preserve a high correlation between the ANN and the converted SNN, because the initial mapping to a normal distribution prevents a large fraction of input values from being pushed close to zero. Figure 3 shows the effects of this normalization method on the data.

Following normalization, the corresponding time series from each recording session were concatenated to produce a single, piecewise-normalized series for each sensor, from which samples could be drawn for training and validating a classifier model. The data samples used in this work are fixed-length time windows containing the signal values from all sensors over a contiguous range of timestamps.

The length of the time window may be chosen arbitrarily to correspond to a time scale of interest. To preserve causality, the label that corresponds to a given time window is defined as the label assigned to the latest measurement in the window. In this context, a sample is an array with shape T × N, where T is the length of a time window that slides over the multivariate time series, and N is the number of sensors. Therefore, the i-th labeled sample (xi, li) in the dataset comprises the sample xi=k=1N[Ck(tiT)Ck(ti)] and label li = L(ti) with Ck indicating the time series of the k-th channel (sensor) and L indicating the time series of labels. Note that the channel order is not relevant for the model architectures used in this work, and changing this order would not be expected to affect final performance.

Because labels were available for every measurement, overlapping time windows were used to make the most of the available data. Time windows that contained measurements with multiple labels were discarded, as in these cases the beginning and end of the window contain measurements from different beverages with an implicit discontinuity between them. Therefore, two consecutive samples share all but the first and last values in each sensor time series. Figure 4 shows a diagram of the sampling scheme used in this work.

Scheme used for sampling a multivariate time series with channels C1Cn for the n = 9 sensors, and label L. Two consecutive, overlapping samples (time windows) and their corresponding labels are shown in green and blue, respectively. A third time window, shown in red, would be discarded because the measurements it contains span multiple labels.

Shorter time windows are preferable to longer ones for several reasons: Firstly, the number of time windows discarded due to spanning several labels increases with the window's length, allowing a system using shorter time windows to make better use of a limited dataset. Secondly, they afford the trained system a shorter response time during online inference. Lastly, longer time windows require networks with a greater number of connections than would otherwise be necessary. The shortest sample length that suffices to capture the input signal's relevant dynamics should thus be preferred. A window length of 16 measurements (i.e., window length 16) was used throughout this work, corresponding to 16 s of sensor recordings.

Most of the available samples from all recording sessions were used for training the ANN models, with the last complete measurement cycle from the final session reserved for testing. Our motivation not to sample the dataset randomly was twofold: On the one hand, the use of overlapping time windows means that if the dataset were sampled randomly, a large fraction of the measurements from each sample in the test set would be identical with measurements from several other samples in the training set; on the other hand, from the point of view of a given test sample, random sampling would allow the model to train on data samples subsequent to those used for test. This is not a condition that will happen in practice.

The recordings in the MicroBeTa dataset pose several potential challenges for a classification algorithm. In particular, three ISFET sensors had to be replaced before the second recording session; and two more sensors had to be replaced before the third session, as mentioned above. Furthermore, the first session contains significantly more data than either of the subsequent sessions after preprocessing, comprising 55% of the dataset as shown in Table 3. Because of the sequential sampling strategy and difference in session lengths, using all sessions means that the network is trained primarily on data from the first session, while the test set contains measurements from the third. Nonetheless, the models were ultimately found to generalize well. Such a good generalization could be favored by the small number of sensors replaced along the third session of recordings, and to the presence of full measurement cycles from that session in the training set.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A