4.6. Statistical Analysis

BB Benjamin Buchard
CT Camille Teilhet
NS Natali Abeywickrama Samarakoon
SM Sylvie Massoulier
JJ Juliette Joubert-Zakeyh
CB Corinne Blouin
CR Christelle Reynes
RS Robert Sabatier
AB Anne-Sophie Biesse-Martin
MV Marie-Paule Vasson
AA Armando Abergel
AD Aicha Demidem
request Request a Protocol
ask Ask a question
Favorite

A pre-screening was proposed to remove useless features (ppm locations) according to discrimination: we removed technical artefacts, constant, and redundant features. We applied the latter two steps independently for each comparison (Figure 5).

Complete workflow of the discrimination process: HCC-F0F1 compared to HCC-F3F4; Raw Nuclear Magnetic Resonance (NMR) aqueous spectra of HCC-F0F1 ≈ 4500 ion peaks (A); removal of technical artefacts, constant and redundant features = 1275 ion peaks (B); choice of the most discriminant metabolites in the aqueous phase by using Genetic Algorithm with Linear Discriminant Analysis = 5 ion peaks (minimum 45 selections), Final solution = 3 discriminant identified metabolites (C).

A univariate analysis is not likely to highlight the best synergistic subset of features. Hence, a multivariable analysis using a combination of several metabolites is a more informative approach. However, after this pre-screening, it was impossible to test all feature subsets within a reasonable amount of time. We chose genetic algorithms (GAs) to perform the selection of subsets. GAs are optimization algorithms, based on the process of natural selection [50,51]. They provide approximate solutions to complex optimization problems. In a first step, a population of potential solutions is randomly generated. Then, this population evolves through the iterative application of mutation, cross-over and selection.

In our model, solutions were subsets of features. The mutation randomly altered each solution by addition, removal, or substitution of a feature. The cross-over randomly combines the features of two solutions. Selection is the only operator increasing the quality of solutions across generations. It relies on a fitness function quantifying the solution quality. A Linear Discriminant Analysis (LDA) was applied on each solution [52]. To avoid over-fitting, a two-fold cross-validation was used to evaluate the accuracy. The fitness function uses this accuracy penalized by the subset size to favor parsimonious solutions. For this purpose, we chose 10 as the maximal size for subsets. The GA was run 10 times, and the solutions obtained on the last generations were evaluated by the average cross-validated LDA accuracy. In order to identify the most interesting features, we used the frequency of selection of each feature in the final generations (Figure 5). Indeed, the more frequently a feature is selected to survive across generations, the more likely it is to play its part in discrimination. The value of the frequency threshold has been set using “random” GAs without any learning step.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A