Feature selection

This protocol is extracted from research article:

A machine learning-based framework for diagnosis of COVID-19 from chest X-ray images

**
Interdiscip Sci**,
Jan 2, 2021;
DOI:
10.1007/s12539-020-00403-6

A machine learning-based framework for diagnosis of COVID-19 from chest X-ray images

Procedure

Feature selection plays a crucial role for raw data representation. It is among top of the list hotest research subject in computer vision and machine learning domain. The main aim is to obtain highly discriminative features from the raw data that have potential to enhance the classification accuracy of the classifier.

The explosion of data set size triggers the development of various data dimensionality reduction techniques to boost the performance of data mining and classification systems. Various feature selection and extraction methods have been developed such as random-forest feature selection, PCA, linear discriminant analysis (LDA), forward feature selection and backward feature elimination methods. We employed PCA as feature extraction technique due to its simplicity, efficiency and popularity for being an oldest multivariate technique.

PCA is a multivariate statistical procedure that analyzes dependent and inter-correlated variables in original dataset and extracts the important information by transforming it to a new set of orthogonal variables called principal components [31]. A new set of principal components is attained, each with certain variance, while the first principal component attains the highest variance among others. Amount of information to retain strongly depends on selection of principal components; therefore, maximum amount of information can be retained by selecting appropriate amount of principal components to reduce data dimensionality.

PCA reduces the 2-D matrix, X (N, M) pertaining images, (where M is total pixels after masking, and N is number of instances such that N < M), to smaller matrix Z(N, L), (where L is the number of pixels such that L < M), while retaining much information from data, using linear transformation U(M, L) [31, 32].

It calculates covariance matrix S (L, L) to represent the information as

The maximization of covariance yields eigenvector equations with Lagrange multiplier, λ. These eigenvector equations are then decomposed using matrix diagonalization that results S as a product of three matrices:

where *D* corresponds to diagonal matrix, consists of Eigen values, and P refers to matrix of eigenvector. Therefore, the sum of eigenvalues corresponds to entire variance of the transformation is

To project the top L eigenvectors data along the subset of these M vectors, the variance retained is

Hence, the amount of information retained is expressed as percentage of the original using

Using these equations, PCA first calculates the mean of every dimension of whole dataset, computes the covariance matrix and determines the eigenvector and eigenvalue pairs using matrix diagonalization. Later, it sorts these pairs by decreasing order of eigenvalues. Since the eigenvalues are proportional to variance retained, the selection of top L pairs will retain most of the information while using only fraction of original dimensions.

This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A

Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.