Statistical tools and multivariate analysis

CD Catalina Dumitrascu
YF Yiannis Fiamegos
MG Maria Beatriz de la Calle Guntiñas
ask Ask a question
Favorite

The Student’s t-test (95% confidence interval), run to identify the elements (variables) whose mass fractions were significantly different in the Basmati, Thai, and Long Grain rice groups, respectively, was carried out with the software Statistica (TIBCO, version 13.5.0.17).

Multivariate analysis of the data was carried out with the software SIMCA version 15.0.2 (Sartorius Stedim Biotech AS, Malmö, Sweden) [17].

Principal component analysis (PCA), a non-supervised technique, and partial least square discriminant analysis (PLS-DA), a supervised technique that maximises the differences between two groups, were used to visually evaluate if the analysed rice form different clusters, corresponding to Basmati, Thai, and Long Grain rice, respectively. PLS-DA and soft independent modelling by class analogy (PCA-Class), a supervised technique that maximises the similarities among the observations within one group, were used for classification purposes.

In all multivariate studies, the amount of principal components was kept to three to avoid overfitting.

The presence of outliers in any of the three studied groups was evaluated with the Mahalanobis distance (DModX PS+), which is the distance between a point and the centroid of the distribution. A sample was considered an outlier when its Mahalanobis distance was larger than Dcrit (95% confidence interval). None of the samples included in this study was flagged as outlier, and hence none was removed for further statistical analysis.

DModX PS+ was also used to study the false positive (FP) and false negative (FN) rates in each model. False positives are samples not flagged as outliers by the models although they do not belong to the targeted group, and false negatives are samples flagged as outliers although they belong to the targeted group. When the rate of false negatives is high, the sensitivity of the model is poor, while a high rate of false positives is linked to poor specificity. The proportion of correct classifications (true positives, TP, and true negatives, TN) is referred to as the accuracy of the models. The rate of TP, TN, FP, and FN of the different models was assessed by cross-validation, leaving each time one sample out and using the resulting model to classify the left out sample. The sensitivity, specificity, and accuracy of the models constructed were calculated according to Barbosa et al. [18].

Rice samples were classified following a two-step approach. First, samples were classified using PCA-Class, which is the term used by the software for SIMCA. When a sample was allocated into more than one class, it was classified as belonging to the one with the highest probability of class membership. When the highest probability of class membership for a certain sample does not correspond to the group to which the sample belongs according to the label information, the sample is considered a false negative. False negatives were classified in a second step using a PLS-DA model constructed for two groups: the group to which the sample belongs according to the information on the label and the group indicated by PCA-Class as the highest probability of class membership. This approach has been successfully applied to the classification of Spanish PDO honeys of different geographical and botanical origin [14].

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A