The workflow of the DiMe strategy applied in this work was systematically illustrated in Figure 2a. In this study, four pairs of metabolomics benchmark datasets were adopted to assess the performance of DiMe strategy, which included the pair of experimental dataset (1) and dataset (2) from MTBLS17 ESI+ (Haug et al., 2013), the pair of experimental dataset (3) and dataset (4) from MTBLS17 ESI- (Haug et al., 2013), the pair of experimental dataset (5) and dataset (6) from MTBLS19 ESI+ (Haug et al., 2013), and the pair of experimental dataset (7), and dataset (8) from MTBLS19 ESI- (Haug et al., 2013). In each experimental dataset, the peak detection, retention time (RT) correction and peak alignment were first applied to the UHPLC/Q-TOF-MS raw data (in CDF format) using the xcmsSet, group and rector functions in XCMS package (Smith et al., 2006) by setting both fwhm and bw equal to ten (Li et al., 2016). Then, two datasets in each pair were merged based on their m/z values with tolerance of 0.05 ppm (Zhang et al., 2014). In particular, the common peaks within above tolerance between two datasets was selected, based on which these datasets were merged into a large one.
Schematic representations of the workflows of the analytical strategies applied in this study. (a) the pipeline of direct merge; (b) the pipeline of results integration.
Prior to the biomarker identification, the datasets were frequently pretreated in current metabolomics study (De Livera et al., 2012; Zhu et al., 2018; Zuo et al., 2018). Herein, the pretreatment of merged dataset was then conducted, which included the missing value imputation using k-Nearest Neighbor (KNN) method and data normalization using MSTUS. The KNN method imputed values based on K features similar to the features with missing values (Shah et al., 2017). Among the available imputation methods, the KNN algorithm was reported as the most robust one for analyzing MS-based metabolomic data (Di Guida et al., 2016). By assuming that the number of increased and decreased metabolic signals is relatively equivalent, the MSTUS adopted the total signal of metabolites that was shared by all samples (Warrack et al., 2009). MSTUS was referred as one of the best choices for overcoming sample variability in urinary metabolomics and was used to identify diagnostic and prognostic biomarkers (Chen et al., 2013; Mathe et al., 2014). Therefore, the KNN algorithm and the MSTUS method were adopted in this study to impute the missing signal of metabolite and transform/normalize the data matrix. After the above preparation, the training, testing and independent test datasets were further constructed based on the random sampling of the merged dataset. These three datasets were prepared for assessing the identification precision and classification capacity of DiMe strategy (described in the last section of “Materials and Methods”). Furthermore, another 10 datasets were generated by the random sampling of half of the merged dataset for 10 times, which were further used for evaluating the robustness of DiMe strategy (described in the last section of “Materials and Methods”).
After all those steps prepared above, the PLSDA was used to identify the differential metabolic peaks between distinct sample groups within each merged dataset. Particularly, the differential peaks were identified by VIP >1 and p-value < 0.05 (Fan et al., 2016), which were subsequently annotated based on human metabolome database (HMDB) (Wishart et al., 2013) by setting m/z tolerance equal to 20 ppm (Peng and Li, 2013). Those resulting metabolites annotated were the metabolic biomarkers finally identified. All in all, the workflow of DiMe strategy applied in this study was systematically illustrated in Figure 2a.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.