Chemometrics and multivariate data analysis

VC Vicky Caponigro
AT Anna L. Tornesello
FM Fabrizio Merciai
DG Danila La Gioia
ES Emanuela Salviati
MB Manuela G. Basilicata
SM Simona Musella
FI Francesco Izzo
AM Angelo S. Megna
LB Luigi Buonaguro
ES Eduardo Sommella
FB Franco M. Buonaguro
MT Maria L. Tornesello
PC Pietro Campiglia
ask Ask a question
Favorite

The filtered data were processed and analysed using Matlab R2022b by MathWorks Inc. in Natick, MA, USA. The analysis involved both custom-developed routines and standard Matlab functions for multivariate data analysis. Each dataset was analysed separately, and low-level data fusion was employed as a component of the analysis process [22]. Further information regarding the low-level data fusion can be found in Additional file 1: Section S.2.1 Data Fusion.

The lipidomics and metabolomics datasets were independently pre-processed. Metabolomics datasets were normalized by the total ion sum in Matlab, while lipidomics data were normalized against class-specific internal standards. For missing values and zeros, one-fifth of the minimum value for the target molecule in the dataset was used for replacement. After that, base 10 logarithms were calculated for the values. Subsequently, the data were scaled using autoscaling, which involved centering each variable by subtracting its mean and then dividing by its standard deviation. These pre-processing steps were completed before further chemometric modelling were performed.

Exploratory analysis was conducted independently on each data block using Principal Component Analysis (PCA) after column autoscaling. SUM-PCA was performed fusing the data blocks [22, 23]. PCA is an unsupervised algorithm used to analyse and reduce the dimensionality of high-dimensional datasets, revealing important features or principal components (PCs). To visualize initial discrimination effectiveness, Hotelling (T2) confidence ellipses were added to score plots for each class. These T2 confidence ellipses, calculated independently for each class, had a confidence level of 95%. To ensure the reliability and comparability of classification models across all omics modalities, the Kennard-Stone algorithm (KS algorithm) was applied to define common training and test sets for all omics modalities. This algorithm divides data into training and test sets based on sample distances, further information can be found in Additional file 1: Section S.2.2 Kennard-Stone algorithm. The KS algorithm was applied to the Tsup of each class, i.e., SUM-PCA was calculated independently for each class set to identify 70% of the data for each class as the training set, and the remaining 30% was identified as the test set [24].

Two independent classification models, Partial Least Square Discriminant Analysis (PLS-DA) [25] and Soft Independent Modelling of Class Analogy (SIMCA), were employed for each modality [26, 27]. The optimal number of latent variables (LV) and PCs was determined through cross-validation (leave one out) to minimize misclassification errors and maximize accuracy. Model evaluation involved confusion matrices, True Positive (TP), i.e., the number of correctly classified samples, True Negative (TN), the sum of misclassified samples for that specific class, are used to determine the correctness of the class predictions. On the other hand, False Positive (FP), (the sum of other class members classified in our class) and False Negative (FN) (the sum of samples not belonging to our class or not classified in our class) values to assess class predictions. Sensitivity, specificity, and accuracy were calculated for each class.

Additionally, loadings and Variable Importance in Projection (VIPs) scores were analysed to identify relevant molecules and compared to p-values obtained from the N-way ANalysis of VAriance (ANOVA). The correlation of these variables with age was computed reporting the significance of this value. For more details about the chemometrics approach, refer to Additional file 1: Section S.2.3 Classification algorithms.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A