2.2. PACA: Phenotype-Aware Component Analysis

Aditya Gorla; Sriram Sankararaman; Esteban Burchard; Jonathan Flint; Noah Zaitlen; Elior Rahmani

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.2. PACA: Phenotype-Aware Component Analysis

AG Aditya Gorla

SS Sriram Sankararaman

EB Esteban Burchard

JF Jonathan Flint

NZ Noah Zaitlen

ER Elior Rahmani

This method is extracted from research article: bioRxiv, Jan 2023

Phenotypic subtyping via contrastive learning

DOI: 10.1101/2023.01.05.522921

Request a Protocol

Ask a question

Favorite

Using a singular value decomposition we can consider an alternative formulation for the model in (1)-(2):

where each of $U^{0} \in ℝ^{m \times k_{0}}$ , $V_{X}^{0} \in ℝ^{n_{1} \times k_{0}}$ , $V_{Y}^{0} \in ℝ^{n_{0} \times k_{0}}$ , $U^{1} \in ℝ^{m \times k_{1}}$ , and $V_{X}^{1} \in ℝ^{n_{1} \times k_{0}}$ forms an orthonormal basis. Under this presentation, we are interested in learning $V_{X}^{1}$ , a linearly-transformed surrogate for $Z_{X}^{1}$ in Eq. (1). Given the directions of the shared sources of variation U⁰, note that

which established a way to remove the expected effect it induces on X (i.e., the signals in X that are coming from sources of variation that exist in controls). Our PACA algorithm therefore estimates and removes the effects of U⁰ on X, followed by PCA for capturing $V_{X}^{1}$ . Our main focus is thus estimating U⁰.

Since U⁰ is found in both X, Y, we can estimate it by employing a canonical correlation analysis (CCA) [⁵⁹]. Importantly, unlike in a typical application of CCA, where we wish to find linear transformations of the features that yield the highest correlation between the samples in two datasets (i.e., vectors in the samples space), here, we seek linear transformations of the samples that provide vectors in the features space $ℝ^{m}$ . Specifically, we find the first pair of canonical variables by solving:

Setting ${\hat{u}}_{1}^{0} = X \hat{a}$ yields the representation in X of the strongest direction of shared variation across X and Y. Let S_XY be the empirical sample cross-covariance of the matrices X and Y, the solution for $\hat{a}$ , $\hat{b}$ is known to be given by the eigenvector that corresponds to the top eigenvalue of $S_{X X}^{- 1} S_{X Y} S_{Y Y}^{- 1} S_{Y X}$ and the eigenvector that corresponds to the top eigenvalue of $S_{Y Y}^{- 1} S_{Y X} S_{X X}^{- 1} S_{X Y}$ , respectively [⁵⁹]. This procedure can be repeated iteratively by restricting the vectors ${\hat{u}}_{r}^{0}$ to be orthogonal to the previous canonical variables ${\hat{u}}_{1}^{0}, \dots, {\hat{u}}_{r - 1}^{0}$ (and similarly for the variables of Y). Eventually, the collective of these vectors ${\hat{U}}^{0}$ can be used for removing the shared sources of variation from X following Eq (6); see a summary of PACA in Algorithm 1. As we discuss next, choosing the number of components r to be used can be informed by the structure induced in X and Y by the shared variation.

We note that the CCA step in PACA imposes the limitation m ≥ max(n₀, n₁). In practice, there may be datasets in which this condition does not hold. In such cases, we can in principle apply PACA on a subset of the samples, however, this is clearly sub-optimal to exclude data. Instead, we can use a randomized version of PACA that can operate under the setup m < max(n₀, n₁) by using multiple random subsets of the data for estimating the full sample covariance matrix of X (while accounting for the shared sources of variation between X and Y). See Supplementary Methods and Supplementary Algorithm 2 for details.

In typical settings, the parameter k₀ of the dimension of U⁰ is unknown. We may estimate it by setting an estimate ${\hat{k}}_{0}$ to be the largest value r that reveals a significant correlation between the correlated subspace of X and Y induced by ${\hat{a}}_{r}$ , ${\hat{b}}_{r}$ ; this can be achieved, for example using Bartlett’s Chi-squared test or Rao’s approximate F statistic [⁶⁰]. However, in practice, we found that in the data we analyzed k₀ seems to be very large (typically a few hundreds). Hence, this approach can lead to over-regularization and reduction in statistical power due to the unnecessary removal of a large number of axes of shared variation. In order to see why, recall that our goal is to be able to systematically target variation of interest that is typically weaker than the dominant sources of variation in the data. While standard dimensionality reduction tools cannot achieve this when applied to the original data, we expected them to be effective in revealing the target variation if applied to a residualized (adjusted) version of the data in which all sources of variation that are stronger than the target variation are removed. In other words, there is no need to estimate the true k₀ and remove a structure of this dimension from the data prior to applying dimensionality reduction. Instead, we wish to find what is the minimal number of axes of variation k that we need to remove in order to reveal variation that is unique to cases. Below, we provide a brief description of the algorithm; see Algorithm 2 for complete details.

Given all shared axes of variation learned from the data using the procedure in Eq. (7), we learn k ∈ {1, …, min(n₀, n₁) using binary search as follows. At each given candidate k, we evaluate the variance of the top PC we calculate from the residualized X accounting for the top k shared axes of variation; this same PC, which may reflect cases-specific variation, is also used for evaluating the variance it can explain in the residualized Y matrix. Comparing these variances to ”null” variances obtained while permuting the loadings of the PC allows us to call whether accounting for k axes of shared direction is sufficient to detect structure that is unique to X. This evaluation can lead to one of four scenarios, based on which we decide on the next partition of the binary search: there can be (i) significant variation in cases but not in controls, (ii) significant variation in cases and controls, (iii) no variation in cases or controls, or (iv) significant variation in controls but not in cases.

Scenarios (i) and (iii) lead us to consider lower values of k due to possible over correction. The former means we revealed cases-specific variation, yet, we may be able to refine the signal if we can identify it using lower k, and the latter indicates that all the structure in the data has been removed. Scenario (iv) indicates a violation of our model assumption, which leads to termination, and lastly, scenario (ii) indicates residual shared variation, which suggests increasing k. Empirically, we found that considering a threshold γ on the maximum ratio between the variance of X’s PC and the variance it explain in Y under scenario (ii) is more stable and improves performance. Specifically, if the ratio between the variances of X, Y is greater than a predefined threshold then we handle this case similarly to case (i), and otherwise, we assume under-correction and consider a larger k. Throughout our analysis, unless stated otherwise, we used γ = 10.

This work is licensed under a Creative Commons Attribution 4.0 International License, which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol