Direct Data Merge (DiMe) Strategy Used in This Study Based on the m/z Values

Xuejiao Cui; Qingxia Yang; Bo Li; Jing Tang; Xiaoyu Zhang; Shuang Li; Fengcheng Li; Jie Hu; Yan Lou; Yunqing Qiu; Weiwei Xue; Feng Zhu

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Direct Data Merge (DiMe) Strategy Used in This Study Based on the m/z Values

XC Xuejiao Cui

QY Qingxia Yang

BL Bo Li

JT Jing Tang

XZ Xiaoyu Zhang

SL Shuang Li

FL Fengcheng Li

JH Jie Hu

YL Yan Lou

YQ Yunqing Qiu

WX Weiwei Xue

FZ Feng Zhu

This method is extracted from research article: Front Pharmacol, Feb 2019

Assessing the Effectiveness of Direct Data Merging Strategy in Long-Term and Large-Scale Pharmacometabonomics

DOI: 10.3389/fphar.2019.00127

Ask a question

Favorite

The workflow of the DiMe strategy applied in this work was systematically illustrated in Figure 2a. In this study, four pairs of metabolomics benchmark datasets were adopted to assess the performance of DiMe strategy, which included the pair of experimental dataset (1) and dataset (2) from MTBLS17 ESI+ (Haug et al., 2013), the pair of experimental dataset (3) and dataset (4) from MTBLS17 ESI- (Haug et al., 2013), the pair of experimental dataset (5) and dataset (6) from MTBLS19 ESI+ (Haug et al., 2013), and the pair of experimental dataset (7), and dataset (8) from MTBLS19 ESI- (Haug et al., 2013). In each experimental dataset, the peak detection, retention time (RT) correction and peak alignment were first applied to the UHPLC/Q-TOF-MS raw data (in CDF format) using the xcmsSet, group and rector functions in XCMS package (Smith et al., 2006) by setting both fwhm and bw equal to ten (Li et al., 2016). Then, two datasets in each pair were merged based on their m/z values with tolerance of 0.05 ppm (Zhang et al., 2014). In particular, the common peaks within above tolerance between two datasets was selected, based on which these datasets were merged into a large one.

Schematic representations of the workflows of the analytical strategies applied in this study. (a) the pipeline of direct merge; (b) the pipeline of results integration.

Prior to the biomarker identification, the datasets were frequently pretreated in current metabolomics study (De Livera et al., 2012; Zhu et al., 2018; Zuo et al., 2018). Herein, the pretreatment of merged dataset was then conducted, which included the missing value imputation using k-Nearest Neighbor (KNN) method and data normalization using MSTUS. The KNN method imputed values based on K features similar to the features with missing values (Shah et al., 2017). Among the available imputation methods, the KNN algorithm was reported as the most robust one for analyzing MS-based metabolomic data (Di Guida et al., 2016). By assuming that the number of increased and decreased metabolic signals is relatively equivalent, the MSTUS adopted the total signal of metabolites that was shared by all samples (Warrack et al., 2009). MSTUS was referred as one of the best choices for overcoming sample variability in urinary metabolomics and was used to identify diagnostic and prognostic biomarkers (Chen et al., 2013; Mathe et al., 2014). Therefore, the KNN algorithm and the MSTUS method were adopted in this study to impute the missing signal of metabolite and transform/normalize the data matrix. After the above preparation, the training, testing and independent test datasets were further constructed based on the random sampling of the merged dataset. These three datasets were prepared for assessing the identification precision and classification capacity of DiMe strategy (described in the last section of “Materials and Methods”). Furthermore, another 10 datasets were generated by the random sampling of half of the merged dataset for 10 times, which were further used for evaluating the robustness of DiMe strategy (described in the last section of “Materials and Methods”).

After all those steps prepared above, the PLSDA was used to identify the differential metabolic peaks between distinct sample groups within each merged dataset. Particularly, the differential peaks were identified by VIP >1 and p-value < 0.05 (Fan et al., 2016), which were subsequently annotated based on human metabolome database (HMDB) (Wishart et al., 2013) by setting m/z tolerance equal to 20 ppm (Peng and Li, 2013). Those resulting metabolites annotated were the metabolic biomarkers finally identified. All in all, the workflow of DiMe strategy applied in this study was systematically illustrated in Figure 2a.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol