Preprocessing23 and data analysis were carried out using MATLAB 2014b (The Math Works, MA). Data analysis was performed using three MATLAB toolboxes: PLS Toolbox version 7.9.3 for preprocessing (Eigenvector Research, Inc.), GA-LDA for feature selection (available at, and the Classification Toolbox for MATLAB used for graphical outputs of an LDA algorithm (available at The spectra were preprocessed by truncating the fingerprint region (1800–900 cm–1), followed by Savitzky–Golay smoothing (9 point window, 2nd order polynomial fitting), automatic weighted least-squares baseline correction, and vector normalization. The triplicate replicate spectra per sample were averaged before model construction. Toward exploratory data analyses, following preprocessing of raw spectra, spectral data were mean-centered and evaluated by means of principal component analysis (PCA).25 PCA is an unsupervised technique that reduces the spectral data space to principal components (PCs) responsible for the majority of variance in the original data set. Each PC is orthogonal to each other, where the first PC accounts to the maximum explained variance followed by the second PC and so on. The PCs are composed of scores and loadings, where the first represents the variance in the sample direction, thus used to assess similarities/dissimilarities among the samples; the latter represents the contribution of each variable for the model decomposition, thus used to find important spectral markers. This technique looks for inherent similarities/differences and provides a scores matrix representing the overall “identity” of each sample, a loading matrix representing the spectral profile in each PC, and a residual matrix containing the unexplained data. Scores information can be used for exploratory analysis providing possible classification between data classes.

PCA was the method of choice for analyzing saliva samples spiked with an inactivated virus particle. It is simple, fast, and combines exploratory analysis, data reduction, and feature extraction into one single method. PCA scores were used to explore overall data set variance and any clustering related to the limit of detection, while the loadings on the first two PCs were used to derive specific biomarkers indicative of the infection category.

A genetic algorithm (GA) is a variable selection technique used to reduce the spectral data space into a few variables and works by simulating the data throughout an evolutionary process.26,27 The original space is maintained for both algorithms and no transformation is made as in PCA. Therefore, the selected variables have the same meaning of the original ones (i.e., wavenumbers), and they are responsible for the region where there are more differences between the classes being analyzed or, in other words, between the chemical changes.

For all classification models, samples were divided into training (50 designated negative and 50 designated positive for COVID-19 infection based on symptoms and RT-PCR; see Tables S2 and S3) and validation (n = 61 designated negative and 20 designated positive for COVID-19 infection based on symptoms and RT-PCR) sets by applying the Kennard–Stone (KS) uniform sampling selection algorithm.23 The training samples were used in the modeling procedure, whereas the prediction set was only used in the final classification evaluation using the LDA discriminant approach. The optimal number of variables for GA was determined with an average risk G of LDA misclassification. Such cost function is calculated in a subset of the training set as

where gn is defined as

where the numerator is the squared Mahalanobis distance between the object xn and the sample mean mI(n) of its true class, and the denominator is the squared Mahalanobis distance between the object xn and the mean of the closest wrong class.25,28

The GA calculations were performed during 100 generations with 200 chromosomes each. One-point crossover and mutation probabilities were set to 60 and 10%, respectively. GA is a nondeterministic algorithm, which can give different results by running the same equation/model. Therefore, the algorithm was repeated three times, starting from random initial populations, with the best solution resulting from the three realizations of GA employed.

Sensitivity (the probability that a test result will be positive when the disease is present) and specificity (the probability that a test result will be negative when the disease is not present) were given by the following equations

where TP is defined as true positive, FN as false negative, TN as true negative, and FP as false positive.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.