Chemometrics and multivariate data analysis

Vicky Caponigro; Anna L. Tornesello; Fabrizio Merciai; Danila La Gioia; Emanuela Salviati; Manuela G. Basilicata; Simona Musella; Francesco Izzo; Angelo S. Megna; Luigi Buonaguro; Eduardo Sommella; Franco M. Buonaguro; Maria L. Tornesello; Pietro Campiglia

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Chemometrics and multivariate data analysis

VC Vicky Caponigro

AT Anna L. Tornesello

FM Fabrizio Merciai

DG Danila La Gioia

ES Emanuela Salviati

MB Manuela G. Basilicata

SM Simona Musella

FI Francesco Izzo

AM Angelo S. Megna

LB Luigi Buonaguro

ES Eduardo Sommella

FB Franco M. Buonaguro

MT Maria L. Tornesello

PC Pietro Campiglia

This method is extracted from research article: J Transl Med, Dec 2023

Integrated plasma metabolomics and lipidomics profiling highlights distinctive signature of hepatocellular carcinoma in HCV patients

DOI: 10.1186/s12967-023-04801-4

Ask a question

Favorite

The filtered data were processed and analysed using Matlab R2022b by MathWorks Inc. in Natick, MA, USA. The analysis involved both custom-developed routines and standard Matlab functions for multivariate data analysis. Each dataset was analysed separately, and low-level data fusion was employed as a component of the analysis process [22]. Further information regarding the low-level data fusion can be found in Additional file 1: Section S.2.1 Data Fusion.

The lipidomics and metabolomics datasets were independently pre-processed. Metabolomics datasets were normalized by the total ion sum in Matlab, while lipidomics data were normalized against class-specific internal standards. For missing values and zeros, one-fifth of the minimum value for the target molecule in the dataset was used for replacement. After that, base 10 logarithms were calculated for the values. Subsequently, the data were scaled using autoscaling, which involved centering each variable by subtracting its mean and then dividing by its standard deviation. These pre-processing steps were completed before further chemometric modelling were performed.

Exploratory analysis was conducted independently on each data block using Principal Component Analysis (PCA) after column autoscaling. SUM-PCA was performed fusing the data blocks [22, 23]. PCA is an unsupervised algorithm used to analyse and reduce the dimensionality of high-dimensional datasets, revealing important features or principal components (PCs). To visualize initial discrimination effectiveness, Hotelling (T²) confidence ellipses were added to score plots for each class. These T² confidence ellipses, calculated independently for each class, had a confidence level of 95%. To ensure the reliability and comparability of classification models across all omics modalities, the Kennard-Stone algorithm (KS algorithm) was applied to define common training and test sets for all omics modalities. This algorithm divides data into training and test sets based on sample distances, further information can be found in Additional file 1: Section S.2.2 Kennard-Stone algorithm. The KS algorithm was applied to the T^sup of each class, i.e., SUM-PCA was calculated independently for each class set to identify 70% of the data for each class as the training set, and the remaining 30% was identified as the test set [24].

Two independent classification models, Partial Least Square Discriminant Analysis (PLS-DA) [25] and Soft Independent Modelling of Class Analogy (SIMCA), were employed for each modality [26, 27]. The optimal number of latent variables (LV) and PCs was determined through cross-validation (leave one out) to minimize misclassification errors and maximize accuracy. Model evaluation involved confusion matrices, True Positive (TP), i.e., the number of correctly classified samples, True Negative (TN), the sum of misclassified samples for that specific class, are used to determine the correctness of the class predictions. On the other hand, False Positive (FP), (the sum of other class members classified in our class) and False Negative (FN) (the sum of samples not belonging to our class or not classified in our class) values to assess class predictions. Sensitivity, specificity, and accuracy were calculated for each class.

Additionally, loadings and Variable Importance in Projection (VIPs) scores were analysed to identify relevant molecules and compared to p-values obtained from the N-way ANalysis of VAriance (ANOVA). The correlation of these variables with age was computed reporting the significance of this value. For more details about the chemometrics approach, refer to Additional file 1: Section S.2.3 Classification algorithms.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol