Dimensional reduction (DR) aims to project high-dimensional data, e.g. spectra measured over a large number of wavelength bins, onto a lower-dimensional space. In this work, we used both a conventional method called Principal Component Analysis (PCA) and a novel method called Signal Dissection by Correlation Maximization (SDCM) to achieve a DR in our data.
SDCM is an unsupervised algorithm for detection of superposed correlations in high-dimensional data sets38. Conceptually, it can be thought of as an extension of PCA for non-orthogonal axes of correlation, where instead of projecting out detected dimensions, the discovered axes of correlation are iteratively subtracted (dissected) from the data. Initially developed for the application in bioinformatics for the clustering of gene expression data, it can be generically applied on any high-dimensional data containing (overlapping) subspaces of correlated measurements.
We denote by the set of real valued matrices, where is the number of features in the data and the number of measurements. The row vectors and the column vectors belong to different vector spaces referred to as feature space and measurement space, respectively.
The main assumption of SDCM is that the input data, , is a superposition of submatrices (also called signatures) and residual noise . We interpret as a physically meaningful hypothesis in the data, e.g. a common physical or chemical property, due to which some samples and features are correlated. As superposing is a non-bijective operation, we need further conditions to dissect into separate . We assume that each is bimonotonic, i.e. that there exists an ordering of the indices and an ordering of the indices such that the reordered matrix is monotonic along all rows and columns. Thus, after reordering, the correlations follow monotonic curves in both feature- and measurement space. While this bimonotonic requirement restricts the applicability of the algorithm, it allows an unambiguous dissection of into the components. In contrast to PCA, it also allows detection of non-linear (bi)monotonic correlations, whose axes are non-orthogonal.
SDCM dissects the data in four steps:
These four steps are performed iteratively until no more representatives of axes can be found. SDCM treats rows and columns completely symmetrically. Each feature and sample is given a strength value s and a weight value w for every signature. The strength value (in units of the input data) quantifies the position along the eigensignal. The weight quantifies how strongly the feature or the sample participates in the signature, i.e. how close to the eigensignal it is. Typically, the number of signatures detected will be orders of magnitudes smaller than the number of input features, and in this way give rise to an effective DR of the data.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.