Model inference

Avik Biswas; Allan Haldane; Eddy Arnold; Ronald M Levy

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Model inference

AB Avik Biswas

AH Allan Haldane

EA Eddy Arnold

RL Ronald M Levy

This method is extracted from research article: eLife, Oct 2019

Epistasis and entrenchment of drug resistance in HIV-1 subtype B

DOI: 10.7554/eLife.50524

Request a Protocol

Ask a question

Favorite

The goal of the model inference is to find a suitable set of Potts parameters ${h, J}$ that fully determines the Potts Hamiltonian $E (S)$ and the total probability distribution $P^{m} (S)$ given in Equation 1 and Equation 2, respectively. This is done by obtaining the set of fields and couplings ${h, J}$ , which yield bivariate marginal estimates that best reproduce the empirical bivariate marginals.

A number of techniques to this effect have been developed previously (Mézard and Mora, 2009; Weigt et al., 2009; Balakrishnan et al., 2011; Cocco and Monasson, 2011; Morcos et al., 2011; Haq et al., 2012; Jones et al., 2012; Ekeberg et al., 2013; Ferguson et al., 2013; Barton et al., 2016a). We follow the methodology given in Ferguson et al. (2013). Given a set of fields and couplings, the bivariate marginals are estimated by generating sequences through a Markov Chain Monte Carlo (MCMC) sampling with the Metropolis criterion for a generated sequence proportional to the exponentiated Potts Hamiltonian. A multidimensional Newton search algorithm is then used to find the optimal set of Potts parameters ${h, J}$ . The descent step in the Newton search is determined after comparing the bivariate marginal estimates generated from the MCMC sample with the empirical bivariate marginal distribution. Although approximations are made in the computation of the Newton steps, the advantage of this method is that it avoids making explicit approximations to the model probability distribution. The method is limited by the sampling error of the input empirical marginal distributions and can also be computationally quite intensive. Our GPU implementation of the MCMC method makes it computationally tractable without resorting to more approximate inverse inference methods. The MCMC algorithm implemented on GPUs has been used to infer Potts models of sequence covariation which are sufficiently accurate to infer higher order marginals as shown by Flynn et al. (2017); Haldane et al. (2018). For a full description of the inference technique, we refer the reader to the supplemental information of Haldane et al. (2016).

The scheme for choosing the Newton update step is important. Ferguson et al. (2013) developed a quasi-Newton parameter update approach which determines the updates to $J^{i j}$ and $h^{i}$ by inverting the system’s Jacobian. In order to simplify and speed up the computation, we take advantage of the gauge invariance of the Potts Hamiltonian. We use a fieldless gauge in which $h^{i} = 0$ for all $i$ , and we compute the expected change in the model bivariate marginals $Δ f_{m}^{i j}$ (hereafter dropping the m subscript) due to a change in $J^{i j}$ to the first order by:

Computing the Jacobian $\frac{\partial f_{S_{i} S_{j}}^{i j}}{\partial J_{S_{k} S_{l}}^{k l}}$ and inverting the linear system in Equation 5 to solve for the changes in $Δ J^{i j}$ and $Δ h^{i}$ given the $Δ f^{i j}$ , is the challenging part of the computation. We choose the $Δ f^{i j}$ as:

with a small enough damping parameter $γ$ such that the linear (and other) approximations are valid.

The computational cost of fitting $(\begin{matrix} L \\ 2 \end{matrix}) \times (4 - 1)^{2} + 93 \times (4 - 1) = 372, 816$ model parameters for the smallest protein in our analysis, PR, on 2 NVIDIA K80 or 4 NVIDIA TitanX GPUs is $\approx 20 â € ‰ h$ . The methodology followed in this analysis is almost the same as done in Flynn et al. (2017) (see Materials and methods) for HIV-1 PR. For a more detailed description of data preprocessing, model inference, and comparison with other methods, we refer the reader to the SI and text of Flynn et al. (2017); Haldane et al. (2016); Haldane et al. (2018); Haldane and Levy (2019).

A repository containing the inference methodology code is available at https://github.com/ComputationalBiophysicsCollaborative/IvoGPU and the final MSAs are available at https://github.com/ComputationalBiophysicsCollaborative/elife_data (copy archived at https://github.com/elifesciences-publications/elife_data). Appendix 1 contains a figure showing the sequence coverage for RT and tables of most entrenched and least entrenched sequences with same Hamming distances, for some important DRMs in PR, RT and IN.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol