A canonical discriminant analysis (CDA) was performed on genetic diversity parameters derived from pedigree analyses of inbreeding (F, %), average relatedness (ΔR, %), average coancestry (C, %), non-random mating rate (α), genetic conservation index (GCI), number of maximum, complete and equivalent generations and number of offspring per individual using the breed (PRE, PRá, and Há horse) to which each animal belonged as a labeling classification criteria, to measure the variation of the genetic diversity parameters estimated, and to establish, identify and outline within population clusters [34,35,36,37]. Hence, we determined the percentage of correctly allocated individuals in their populations of origin in comparison to those animals which were statistically misclassified or attributed to a different breed from the one to which they belonged; to discover a linear combination of genetic diversity parameters that provide maximum separation between the potentially existing different groups when the classification criterion was the breed of the individual. CDA was also used to plot pairs of canonical variables to help visually interpret group differences. Variable selection was performed using regularized forward stepwise multinomial logistic regression algorithms.
The choice to perform a forward stepwise analysis was made considering the following alternatives. On the one hand, the first option considered was to perform a regularized canonical discriminant analysis. Regularization has been reported to improve the estimate of covariance matrices in situations where the number of predictors is larger than the number of data, as in such cases regularization may lead to an improvement of the efficiency of discriminant analysis. However, this was not our case as the nature of the variables considered may lead to the occurrence of considerable problems of multicollinearity. Such multicollinearity problems may derive from the fact that some of the variables initially considered, were computed including others (which were included too) among the terms in their formulas. As a result, even if models were simplified, removed variables may still be considered somehow.
Additionally, the analysis used in the present study must be robust in cases of highly unequal sample sizes. To approach such compromising situation, we used the approximation proposed by Roemisch et al. [38] of a regularized stepwise discriminant analysis. Unequal group sample sizes may affect the quality of classification, not axes. For these reasons, as suggested by Tai and Pan [39] and Roemisch, et al. [38] priors were regularized based on group sizes using the compute from group sizes from the prior probability option in SPSS version 25.0 [40] instead of considering them to be equal.
Furthermore, even if unequal sample sizes are acceptable as reported by Poulsen and French [41], some requirements must still be fulfilled. For instance, the sample size of the smallest group needs to exceed the number of predictor variables. As a “rule of thumb”, the smallest sample size should be at least 20 for each 4 or 5 predictors. The maximum number of independent variables is n−2, where n is the sample size. Although such a low sample size may be valid, it is not encouraged, and generally it is best to have 4 or 5 times as many observations and independent variables for discriminant approaches to be efficient. The present study satisfies this requirement by far, hence the potential distorting effects derived from unequal group sample sizes comparison are avoided.
On the other hand, after the stepwise canonical discriminant analysis approach was chosen, a decision had to be made on either to perform a backward or forward stepwise variable selection approach. As a drawback, stepwise residual sum of squares will typically be above that for best subset, if there is correlation between the predictors considered. For this reason, we performed a multicollinearity analysis and correlated variables exceeding minimally acceptable levels were discarded. Contextually, it must be considered that when regressors (variables) are independent, that is, they are not correlated, the variables chosen by either forwards or backwards stepwise selection methods will be exactly the same.
Conclusively, forward stepwise selection method was chosen given it is rather effective (less time-consuming) than backward selection methods and provided observations do not need to be strictly greater than the number of variables, which may make this approach valid for future research facing the same sample contexts.
Canonical Discriminant Analysis was performed using the Discriminant routine of the Classify package of the software SPSS version 25.0 [40] and the Discriminant Analysis (DA) routine of the Analysing Data package of XLSTAT Pearson Edition [42].
Before running a discriminant canonical analysis (CDA), multicollinearity assumption should be tested for, to ensure redundancies in the variables considered do not affect the structure of the matrices or overinflate variance explanatory potential. The variance inflation factor (VIF) was computed and used as an indicator of multicollinearity. Computationally, it is defined as the reciprocal of tolerance: 1/(1 − R2). A recommended maximum VIF value of 5 [43] and even 4 [44] can be found in the literature. VIF was computed using the Linear routine of the Regression package of the software SPSS, version 25.0 [40].
A canonical correlation analysis is a multivariate analysis of correlation. Canonical is the statistical term for analyzing variables which are latent (not directly observed), but which represent multiple variables (which can be directly observed). The maximum number of canonical correlations between two sets of variables is the number of variables in the smaller set. The first canonical correlation explains most of the relationship between sets [45]. Canonical correlations are interpreted as Pearson’s ρ. Hence, squared canonical correlation (Rc squared) [46] is the percent of variance in one set of variables explained by the other set along the dimension represented by the given canonical correlation (usually the first), that is the percent of shared variance along this dimension (analogous to R squared in multiple regression) [47]. As a rule of thumb, meaningful dimensions are detected when their canonical correlation are ≥0.30, which corresponds to about 10% of explained variance. All meaningful and interpretable canonical correlations should be reported, despite reporting of only the first dimension being common in research [48].
Wilks’ Lambda test assesses which variables significantly contribute to the discriminant function. As a rule of thumb, the closer Wilks’ lambda is to 0, the higher is the contribution of that variable to the discriminant function. Wilk’s Lambda significance can be tested using χ2. When significance is below 0.05, the corresponding function can be concluded to explain group adscription well [49].
Pillai’s trace criterion was used to test the assumption of equal covariance matrices in discriminant function analysis (DFA). As smaller significance levels (p < 0.001) are considered [50] and sample sizes are unequal, Pillai’s trace criterion is the only presumably acceptable test to determine equality of covariance matrices [51]. Pillai’s criterion (as opposed to Wilk’s lambda) when used to test large samples, randomly deletes cases from the sample to equalize the numbers in each group, which enables assuming power can be maintained at a sensible level. Furthermore, Pillai’s Trace test is very robust and not highly linked to assumptions about the normality of the distribution of the data and is also preferable if we had violated the assumption of homogeneity of variance-covariance. Pillai’s criterion was computed using the Multivariate routine of the General Linear Model package of the software SPSS, version 25.0 [40]. In general, a significance below 0.05 means that there is a significant difference in the dependent variables (genetic parameters) across the levels of independent variables being tested, in our case the breed factor and its levels or possibilities (PRá, PRE, and Há horse breeds) [52].
A preliminary principal component analysis (PCA) was performed to minimize overall variables into few meaningful variables that contributed most to variations in the breeds.
Discriminant function analysis was used to determine the percentage assignment of individuals into their own breeds. The traditional approach to interpreting discriminant functions examines the sign and magnitude of the standardized discriminant weight (also referred to as a discriminant coefficient) assigned to each variable in computing the discriminant functions. Small weights may indicate either that a certain variable is irrelevant in determining a relationship or that it has been discarded because of a high degree of multicollinearity with the rest of variables.
Discriminant loadings represent the variance shared between independent variables and the discriminant function. Discriminant loadings can be interpreted as factor loadings to evaluate the relative contribution of each independent variable to the discriminant function. Variables exhibiting a discriminant loading of ≥|0.40| are considered substantially discriminating variables. Stepwise procedures may prevent non-significant variables from entering the function. Simultaneously, multicollinearity and other factors may preclude a variable from entering the equation, which does not necessarily exclude that it has a substantial effect. Loadings are relatively more valid than weights to interpret the discriminating power of independent variables due to their correlational nature.
The comparison between variables measured on different scales can be performed considering standardized coefficients. Large absolute coefficients will denote a better discriminating ability. Discriminant scores can be computed by using the standardized discriminant function coefficients applied to data that have been centered and divided by the pooled within-cell standard deviations for the predictor variables, as discussed in IBM Corp. [53].
The data were standardized following the standard procedures described by Manly [54] before squared Mahalanobis distances and principal component analysis were calculated. Squared Mahalanobis distances were computed between populations using the following formula:
where is the distance between population i and j and COV−1 the inverse of the covariance matrix of measured variable x and and are the means of variable x in the ith and jth populations, respectively. The Mahalanobis squared distance, defined as the square of the distance between centroids, was used to determine the existence of significant differences in the values for genetic diversity parameters across the three breeds [55]. Additionally, to confirm such differences, Nei’s minimum genetic distances [56] among the individuals of the breeds were computed. Dendrograms for PRá, PRE, and Há breeds were constructed using the construct Unweighted Pair-Group Method using Arithmetic averages (UPGMA) Tree task from the Phylogeny procedure of MEGA X 10.0.5.
The percentage of correctly classified cases is called the hit ratio [57]. To establish whether the percentage of correctly classified cases is enough as to consider discriminant functions issue valid results, as a form of significance, the leave-one-out cross-validation option was used. As reported by Schneider [58], the leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. In this context, the classification rate of a cross validated discriminant analysis should be at least 25% greater than that obtained by chance for classification accuracy to be considered sufficiently achieved.
The validity of cross-validation can be supported by Press’s Q significance test of classification accuracy for original against predicted group memberships. In opposition to t-test for groups of equal size, in Press’ Q statistic, groups (breeds in our case) can be of unequal size [59]. Press’ Q statistic can be used to compare the discriminating power of a cross validated function to a model classifying individuals at random (50% of the cases correctly classified), as follows:
where N is the number of individuals in the sample, n is the number of observations correctly classified, K is the number of groups. Afterwards, the value of Press’ Q statistic should be compared to the critical value of 6.63 for χ2 with one degree of freedom at a significance of 0.01. Under this assumption, when Press’ Q exceeds the critical value of χ2 = 6.63, cross-validated classification can be regarded as significantly better than chance.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.