3.2 Continuous relaxation

EG Elliott Gordon-Rodriguez
TQ Thomas P Quinn
JC John P Cunningham
request Request a Protocol
ask Ask a question
Favorite

The key insight of CoDaCoRe is to approximate our combinatorial optimization problem (Equation 4) with a continuous relaxation that can be trained efficiently by gradient descent. Our relaxation is inspired by recent advances in deep learning models with discrete latent variables (Jang et al., 2017; Linderman et al., 2018; Maddison et al., 2017; Mena et al., 2018; Potapczynski et al., 2020). However, we are not aware of any similar proposals for optimizing over disjoint subsets, nor for learning balances or amalgamations in the context of CoDa.

Our relaxation is parameterized by an unconstrained vector of ‘assignment weights’, wRp, with one scalar parameter per input dimension (e.g. one weight per bacteria species). The weights are mapped to a vector of ‘soft assignments’ via:

where the sigmoid is applied component-wise. Intuitively, large positive weights will max out the sigmoid, leading to soft assignments close to + 1, whereas large negative weights will zero out the sigmoid, resulting in soft assignments close to –1. This mapping is akin to softly assigning input variables to the groups J+ and J–, respectively.

Let us write w˜+=ReLU(w˜) and w˜=ReLU(w˜) for the (component-wise) positive and negative parts of w˜, respectively. We approximate balances (Equation 1) with the following relaxation:

 

In other words, we approximate the geometric averages over subsets of the inputs, by weighted geometric averages over all components (compare Equations 1 and 6).

Crucially, this relaxation is differentiable in w, allowing us to construct a surrogate objective function that can be optimized jointly in (w,α,β) by gradient descent:

Moreover, the computational cost of differentiating this objective function scales linearly in the dimension of w, which overall results in linear scaling for our algorithm. We also note that the functional form of our relaxation (Equation 6) can be exploited in order to select the learning rate adaptively (i.e. without tuning), resulting in robust convergence across all real and simulated datasets that we considered. We defer the details of our implementation of gradient descent to Supplementary Material (Supplementary Section SC.1).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A