2.5. Statistical analysis

George John Kastanis; Luis V. Santana‐Quintero; Maria Sanchez‐Leon; Sara Lomonaco; Eric W. Brown; Marc W. Allard

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.5. Statistical analysis

GK George John Kastanis

LS Luis V. Santana‐Quintero

MS Maria Sanchez‐Leon

SL Sara Lomonaco

EB Eric W. Brown

MA Marc W. Allard

This method is extracted from research article: Mol Ecol Resour, Jan 2019

In‐depth comparative analysis of Illumina® MiSeq run metrics: Development of a wet‐lab quality assessment tool

DOI: 10.1111/1755-0998.12973

Ask a question

Favorite

First, we compared the data from the 15 metrics on an individual instrument basis, with data expressed as means ± standard error (SE). Statistical differences between categorical variables were analysed using the Student's t test in Microsoft Excel 2010 (Microsoft, Redmond, WA, USA); values equal to or smaller than p < 0.05 were considered to be statistically significant.

Next, we compared the 15 run metrics across all 486 MiSeq runs and used these to create a scree plot and a Pearson correlation matrix. We assessed the magnitude of these correlations following a scale similar to the one described by Evans et al. (Evans, 1996), categorizing the absolute value of each Pearson's correlation coefficient (r) as weak (0.00–0.49), moderate (0.50–0.79) or strong (0.80–1.00). From this matrix, a principal components analysis (PCA) using Origin Pro software (OriginLab Corporation, Northhampton, MA, USA) was performed. PCA can be defined as an orthogonal linear transformation that aims to maintain the same variance of the raw data by reducing the number of variables into a new coordinate system of principal components (PCs) through factor scores (Zhang & Castelló, 2017). PCA allows us to observe what factors are at play, and the extent to which they correlate with each other. To observe if our data set was appropriate for a PCA, we looked at two measures of sampling adequacy. First, we applied Bartlett's sphericity test to the data set. The obtained p value was < 0.0001, which allowed us to reject the null hypothesis, meaning that it is appropriate to expect a correlation to be found. Second, we evaluated the data set using the Kaiser–Meyer–Olkin (KMO) index, resulting in an overall score of 0.698, which provided further confidence supporting the use of our data. A PCA was used to reduce our correlated variables to a smaller set of important independent variables. We used each run metric as a variable (Table (Table1),1), and each MiSeq run was treated as an observation.

To explore other possible points of comparison across runs and find similar groups (clusters) in our data, we applied a K‐means clustering algorithm (Tan, Steinbach, & Kumar, 2005). The clustering algorithm begins by randomly initializing K number of clusters, then assigning each MiSeq run into one of these clusters. The centroid of each cluster is updated by calculating a new mean, which, in turn, is used to relocate the position of each cluster centroid. This process is then repeated until all the centroids stop moving, thus allowing the algorithm to converge to a local optimum (Nidheesh, Abdul Nazeer, & Ameer, 2017). We ran this analysis several times using two to five clusters, and with both Euclidean (Kaya, Pehlivanli, Sekizkardes, & Ibrikci, 2017) and Mahalanobis distances (Wang, Hu, Huang, & Xu, 2008).

This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc-nd/4.0/ License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol