2.5. Statistical analysis

GK George John Kastanis
LS Luis V. Santana‐Quintero
MS Maria Sanchez‐Leon
SL Sara Lomonaco
EB Eric W. Brown
MA Marc W. Allard
ask Ask a question
Favorite

First, we compared the data from the 15 metrics on an individual instrument basis, with data expressed as means ± standard error (SE). Statistical differences between categorical variables were analysed using the Student's t test in Microsoft Excel 2010 (Microsoft, Redmond, WA, USA); values equal to or smaller than p < 0.05 were considered to be statistically significant.

Next, we compared the 15 run metrics across all 486 MiSeq runs and used these to create a scree plot and a Pearson correlation matrix. We assessed the magnitude of these correlations following a scale similar to the one described by Evans et al. (Evans, 1996), categorizing the absolute value of each Pearson's correlation coefficient (r) as weak (0.00–0.49), moderate (0.50–0.79) or strong (0.80–1.00). From this matrix, a principal components analysis (PCA) using Origin Pro software (OriginLab Corporation, Northhampton, MA, USA) was performed. PCA can be defined as an orthogonal linear transformation that aims to maintain the same variance of the raw data by reducing the number of variables into a new coordinate system of principal components (PCs) through factor scores (Zhang & Castelló, 2017). PCA allows us to observe what factors are at play, and the extent to which they correlate with each other. To observe if our data set was appropriate for a PCA, we looked at two measures of sampling adequacy. First, we applied Bartlett's sphericity test to the data set. The obtained p value was < 0.0001, which allowed us to reject the null hypothesis, meaning that it is appropriate to expect a correlation to be found. Second, we evaluated the data set using the Kaiser–Meyer–Olkin (KMO) index, resulting in an overall score of 0.698, which provided further confidence supporting the use of our data. A PCA was used to reduce our correlated variables to a smaller set of important independent variables. We used each run metric as a variable (Table (Table1),1), and each MiSeq run was treated as an observation.

To explore other possible points of comparison across runs and find similar groups (clusters) in our data, we applied a K‐means clustering algorithm (Tan, Steinbach, & Kumar, 2005). The clustering algorithm begins by randomly initializing K number of clusters, then assigning each MiSeq run into one of these clusters. The centroid of each cluster is updated by calculating a new mean, which, in turn, is used to relocate the position of each cluster centroid. This process is then repeated until all the centroids stop moving, thus allowing the algorithm to converge to a local optimum (Nidheesh, Abdul Nazeer, & Ameer, 2017). We ran this analysis several times using two to five clusters, and with both Euclidean (Kaya, Pehlivanli, Sekizkardes, & Ibrikci, 2017) and Mahalanobis distances (Wang, Hu, Huang, & Xu, 2008).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A