2.2. Data processing

LS Lee Sherlock
BM Brendan R. Martin
SB Sinah Behsangar
KM K. H. Mok
ask Ask a question
Favorite

In one-dimensional 1H-NMR spectroscopy, the signals are represented as the frequency domain resulting from the Fourier transform of a time-domain signal. These are given in units of parts per million (ppm), which is pre-determined at 0.0 ppm based on the chemical shift reference. Data processing was performed prior to any analysis to ensure the integrity and reliability of the results.

For the Padayachee et al. (19) data, several pre-processing steps were conducted on the 400-MHz spectra using the Varian/Agilent software. These steps involved zero-filling and multiplication by an exponential apodization function of 0.7 Hz before Fourier transformation. Additionally, the spectra underwent manual phasing, automatic baseline correction using polynomials or splines, and referencing to trimethylsilyl-2,2,3,3-tetradeuteropropionic acid (TSP) at 0.015 ppm. The final pre-processing step involved normalizing the spectra by the total area under the curve, without accounting for the water and TSP signals.

Regarding the Rist et al. (22) data, both plasma and urine samples were subjected to untargeted NMR analysis using 1D 1H NMR spectroscopy. Plasma samples were measured at 310 K on an AVANCE II 600 MHz NMR spectrometer equipped with a 1H-BBI probehead and a BACS sample changer, while urine samples were analyzed at 300 K on a Bruker 600 MHz spectrometer equipped with either an AVANCE III with a 1H,13C,15N-TCI inversely detected cryoprobe or an AVANCE II with a 1H-BBI room temperature probe. The plasma spectra were referenced to the ethylenediaminetetraacetic (EDTA) acid signal at 2.5809 ppm and bucketed graphically, ensuring that each bucket contained only one signal or group of signals and no peaks were split between buckets. The urine spectra were resampled for a uniform frequency axis and aligned using “correlation optimized warping.” Subsequently, bucketing was performed using an in-house developed software based on Python, aiming to assign signals or groups of signals to individual buckets without splitting peaks between them. Finally, the resulting bucket tables were used for statistical analyses and machine learning algorithms.

Furthermore, the resulting pre-processing steps from the studies by Rist et al. (22) and Padayachee et al. (19) were subject to further investigation. The investigation of the above outputs was performed using Chenomx NMR Suite 8.1 (Chenomx, Edmonton, Canada) and Human Metabolome Database (HMDB) for the identification of metabolites. In addition, there were a variety of unknowns that could not be identified by harnessing either methodology. Therefore, the results section and corresponding graphs contain these unknown variables that can be identified as “Unknown – PPM”.

The data obtained from the study by Padayachee et al. (19) required further processing steps in an attempt to reduce the background noise and increase the overall resolution of the data. This was conducted by binning the data into further sub-intervals of 0.01 ppm. Conversely, the same approach could not be conducted on the data obtained from the study by Rist et al. (22) as the binning was conducted in-house and correlated with pre-defined metabolites. The difference in binning processes and MHz may be factors that allowed for variation in the results.

As per common practice in NMR, we removed water and its corresponding ppm as this often accounts for the majority of peak intensity and can mask minor variations in the NMR spectra. Due to the difference in obtained data, standardization was required, whereby the negative values within the dataset were set to zero and mean-centered scaling was applied to the Rist et al. (22) data. Feature values were transformed to follow a uniform or normal distribution for the Padayachee et al. (19) data. This helped to stabilize the variance and minimize the effects of outliers, resulting in improved performance of the predictive model. Scaling is important as it facilitates a fair comparison between different features.

Finally, the dataset was divided into two sets: a test set comprising 33% of the data and a training set with 66% of the data. This partitioning ensures an unbiased evaluation of the algorithm's performance. To determine the significance of different features in the dataset, the widely adopted statistical test known as the ANOVA F-test was employed for feature selection. In order to comprehensively evaluate the algorithm, a 10-fold cross-validation technique was applied. This method is commonly employed in machine learning to assess the algorithm's performance across multiple subsets of the dataset. By dividing the data into 10 equal parts, the algorithm was trained and evaluated 10 times, each time using a different combination of nine parts for training and one part for testing. This approach provides a more robust assessment of the algorithm's generalization capability and overall performance.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A