Dimensional reduction and SDCM

Benjamin Lotter; Srumika Konde; Johnny Nguyen; Michael Grau; Martin Koch; Peter Lenz

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Dimensional reduction and SDCM

BL Benjamin Lotter

SK Srumika Konde

JN Johnny Nguyen

MG Michael Grau

MK Martin Koch

PL Peter Lenz

This method is extracted from research article: Sci Rep, Nov 2022

Identifying plastics with photoluminescence spectroscopy and machine learning

DOI: 10.1038/s41598-022-23414-3

Request a Protocol

Ask a question

Favorite

Dimensional reduction (DR) aims to project high-dimensional data, e.g. spectra measured over a large number of wavelength bins, onto a lower-dimensional space. In this work, we used both a conventional method called Principal Component Analysis (PCA) and a novel method called Signal Dissection by Correlation Maximization (SDCM) to achieve a DR in our data.

SDCM is an unsupervised algorithm for detection of superposed correlations in high-dimensional data sets^³⁸. Conceptually, it can be thought of as an extension of PCA for non-orthogonal axes of correlation, where instead of projecting out detected dimensions, the discovered axes of correlation are iteratively subtracted (dissected) from the data. Initially developed for the application in bioinformatics for the clustering of gene expression data, it can be generically applied on any high-dimensional data containing (overlapping) subspaces of correlated measurements.

We denote by $M^{N_{f}, N_{m}}$ the set of real valued $N_{f} \times N_{m}$ matrices, where $N_{f}$ is the number of features in the data and $N_{m}$ the number of measurements. The $N_{f}$ row vectors and the $N_{m}$ column vectors belong to different vector spaces referred to as feature space and measurement space, respectively.

The main assumption of SDCM is that the input data, $D \in M^{N_{f}, N_{m}}$ , is a superposition $D = \sum_{k = 1}^{n} E_{k} + η$ of submatrices $E_{k} \in M^{N_{f}, N_{n}}$ (also called signatures) and residual noise $η$ . We interpret $E_{k}$ as a physically meaningful hypothesis in the data, e.g. a common physical or chemical property, due to which some samples and features are correlated. As superposing is a non-bijective operation, we need further conditions to dissect $D$ into separate $E_{k}$ . We assume that each $E_{k}$ is bimonotonic, i.e. that there exists an ordering $I_{f}$ of the $N_{f}$ indices and an ordering $I_{m}$ of the $N_{m}$ indices such that the reordered matrix ${\tilde{E}}_{k} = E_{k} (I_{f}, I_{m})$ is monotonic along all rows and columns. Thus, after reordering, the correlations follow monotonic curves in both feature- and measurement space. While this bimonotonic requirement restricts the applicability of the algorithm, it allows an unambiguous dissection of $D$ into the $E_{k}$ components. In contrast to PCA, it also allows detection of non-linear (bi)monotonic correlations, whose axes are non-orthogonal.

SDCM dissects the data in four steps:

These four steps are performed iteratively until no more representatives of axes can be found. SDCM treats rows and columns completely symmetrically. Each feature and sample is given a strength value s and a weight value w for every signature. The strength value (in units of the input data) quantifies the position along the eigensignal. The weight $w \in [- 1, 1]$ quantifies how strongly the feature or the sample participates in the signature, i.e. how close to the eigensignal it is. Typically, the number of signatures detected will be orders of magnitudes smaller than the number of input features, and in this way give rise to an effective DR of the data.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol