Dimensional reduction and SDCM

BL Benjamin Lotter
SK Srumika Konde
JN Johnny Nguyen
MG Michael Grau
MK Martin Koch
PL Peter Lenz
request Request a Protocol
ask Ask a question
Favorite

Dimensional reduction (DR) aims to project high-dimensional data, e.g. spectra measured over a large number of wavelength bins, onto a lower-dimensional space. In this work, we used both a conventional method called Principal Component Analysis (PCA) and a novel method called Signal Dissection by Correlation Maximization (SDCM) to achieve a DR in our data.

SDCM is an unsupervised algorithm for detection of superposed correlations in high-dimensional data sets38. Conceptually, it can be thought of as an extension of PCA for non-orthogonal axes of correlation, where instead of projecting out detected dimensions, the discovered axes of correlation are iteratively subtracted (dissected) from the data. Initially developed for the application in bioinformatics for the clustering of gene expression data, it can be generically applied on any high-dimensional data containing (overlapping) subspaces of correlated measurements.

We denote by MNf,Nm the set of real valued Nf×Nm matrices, where Nf is the number of features in the data and Nm the number of measurements. The Nf row vectors and the Nm column vectors belong to different vector spaces referred to as feature space and measurement space, respectively.

The main assumption of SDCM is that the input data, DMNf,Nm, is a superposition D=k=1nEk+η of submatrices EkMNf,Nn (also called signatures) and residual noise η. We interpret Ek as a physically meaningful hypothesis in the data, e.g. a common physical or chemical property, due to which some samples and features are correlated. As superposing is a non-bijective operation, we need further conditions to dissect D into separate Ek. We assume that each Ek is bimonotonic, i.e. that there exists an ordering If of the Nf indices and an ordering Im of the Nm indices such that the reordered matrix E~k=Ek(If,Im) is monotonic along all rows and columns. Thus, after reordering, the correlations follow monotonic curves in both feature- and measurement space. While this bimonotonic requirement restricts the applicability of the algorithm, it allows an unambiguous dissection of D into the Ek components. In contrast to PCA, it also allows detection of non-linear (bi)monotonic correlations, whose axes are non-orthogonal.

SDCM dissects the data in four steps:

These four steps are performed iteratively until no more representatives of axes can be found. SDCM treats rows and columns completely symmetrically. Each feature and sample is given a strength value s and a weight value w for every signature. The strength value (in units of the input data) quantifies the position along the eigensignal. The weight w[-1,1] quantifies how strongly the feature or the sample participates in the signature, i.e. how close to the eigensignal it is. Typically, the number of signatures detected will be orders of magnitudes smaller than the number of input features, and in this way give rise to an effective DR of the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A