Data Preprocessing and Analysis

MS Maneesh N. Singh LB Leonardo Leal Barbosa WM Wena Dantas Marcarini PV Paula Frizera Vassallo JM Jose Geraldo Mill RR Rodrigo Ribeiro-Rodrigues LC Luciene C. G. Campos PW Patrick H. Warnke

This protocol is extracted from research article:

Ultrarapid On-Site Detection of SARS-CoV-2 Infection Using Simple ATR-FTIR Spectroscopy and an Analysis Algorithm: High Sensitivity and Specificity

**
Anal Chem**,
Jan 22, 2021;
DOI:
10.1021/acs.analchem.0c04608

Ultrarapid On-Site Detection of SARS-CoV-2 Infection Using Simple ATR-FTIR Spectroscopy and an Analysis Algorithm: High Sensitivity and Specificity

Procedure

Preprocessing^{23} and data analysis were
carried out using MATLAB
2014b (The Math Works, MA). Data analysis was performed using three
MATLAB toolboxes: PLS Toolbox version 7.9.3 for preprocessing (Eigenvector
Research, Inc.), GA-LDA for feature selection (available at https://doi.org/10.6084/m9.figshare.3479003.v1), and the Classification Toolbox for MATLAB used for graphical outputs
of an LDA algorithm (available at https://michem.unimib.it/download/matlab-toolboxes/classification-toolbox-for-matlab/).^{24} The spectra were preprocessed by
truncating the fingerprint region (1800–900 cm^{–1}), followed by Savitzky–Golay smoothing (9 point window, 2nd
order polynomial fitting), automatic weighted least-squares baseline
correction, and vector normalization. The triplicate replicate spectra
per sample were averaged before model construction. Toward exploratory
data analyses, following preprocessing of raw spectra, spectral data
were mean-centered and evaluated by means of principal component analysis
(PCA).^{25} PCA is an unsupervised technique
that reduces the spectral data space to principal components (PCs)
responsible for the majority of variance in the original data set.
Each PC is orthogonal to each other, where the first PC accounts to
the maximum explained variance followed by the second PC and so on.
The PCs are composed of scores and loadings, where the first represents
the variance in the sample direction, thus used to assess similarities/dissimilarities
among the samples; the latter represents the contribution of each
variable for the model decomposition, thus used to find important
spectral markers. This technique looks for inherent similarities/differences
and provides a scores matrix representing the overall “identity”
of each sample, a loading matrix representing the spectral profile
in each PC, and a residual matrix containing the unexplained data.
Scores information can be used for exploratory analysis providing
possible classification between data classes.

PCA was the method of choice for analyzing saliva samples spiked with an inactivated virus particle. It is simple, fast, and combines exploratory analysis, data reduction, and feature extraction into one single method. PCA scores were used to explore overall data set variance and any clustering related to the limit of detection, while the loadings on the first two PCs were used to derive specific biomarkers indicative of the infection category.

A genetic algorithm (GA) is a variable selection
technique used
to reduce the spectral data space into a few variables and works by
simulating the data throughout an evolutionary process.^{26,27} The original space is maintained for both algorithms and no transformation
is made as in PCA. Therefore, the selected variables have the same
meaning of the original ones (i.e., wavenumbers), and they are responsible
for the region where there are more differences between the classes
being analyzed or, in other words, between the chemical changes.

For all classification models, samples were divided into training
(50 designated negative and 50 designated positive for COVID-19 infection
based on symptoms and RT-PCR; see Tables S2 and S3) and validation (*n* = 61 designated negative
and 20 designated positive for COVID-19 infection based on symptoms
and RT-PCR) sets by applying the Kennard–Stone (KS) uniform
sampling selection algorithm.^{23} The training
samples were used in the modeling procedure, whereas the prediction
set was only used in the final classification evaluation using the
LDA discriminant approach. The optimal number of variables for GA
was determined with an average risk *G* of LDA misclassification.
Such cost function is calculated in a subset of the training set as

where *g _{n}* is defined
as

where the numerator
is the squared Mahalanobis
distance between the object *x _{n}* and the
sample mean

The GA calculations were performed during 100 generations with 200 chromosomes each. One-point crossover and mutation probabilities were set to 60 and 10%, respectively. GA is a nondeterministic algorithm, which can give different results by running the same equation/model. Therefore, the algorithm was repeated three times, starting from random initial populations, with the best solution resulting from the three realizations of GA employed.

Sensitivity (the probability that a test result will be positive when the disease is present) and specificity (the probability that a test result will be negative when the disease is not present) were given by the following equations

where TP is defined as true positive, FN as false negative, TN as true negative, and FP as false positive.

This article is made available via the PMC Open Access Subset for unrestricted RESEARCH re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A

Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.