The goal of the model inference is to find a suitable set of Potts parameters that fully determines the Potts Hamiltonian and the total probability distribution given in Equation 1 and Equation 2, respectively. This is done by obtaining the set of fields and couplings , which yield bivariate marginal estimates that best reproduce the empirical bivariate marginals.
A number of techniques to this effect have been developed previously (Mézard and Mora, 2009; Weigt et al., 2009; Balakrishnan et al., 2011; Cocco and Monasson, 2011; Morcos et al., 2011; Haq et al., 2012; Jones et al., 2012; Ekeberg et al., 2013; Ferguson et al., 2013; Barton et al., 2016a). We follow the methodology given in Ferguson et al. (2013). Given a set of fields and couplings, the bivariate marginals are estimated by generating sequences through a Markov Chain Monte Carlo (MCMC) sampling with the Metropolis criterion for a generated sequence proportional to the exponentiated Potts Hamiltonian. A multidimensional Newton search algorithm is then used to find the optimal set of Potts parameters . The descent step in the Newton search is determined after comparing the bivariate marginal estimates generated from the MCMC sample with the empirical bivariate marginal distribution. Although approximations are made in the computation of the Newton steps, the advantage of this method is that it avoids making explicit approximations to the model probability distribution. The method is limited by the sampling error of the input empirical marginal distributions and can also be computationally quite intensive. Our GPU implementation of the MCMC method makes it computationally tractable without resorting to more approximate inverse inference methods. The MCMC algorithm implemented on GPUs has been used to infer Potts models of sequence covariation which are sufficiently accurate to infer higher order marginals as shown by Flynn et al. (2017); Haldane et al. (2018). For a full description of the inference technique, we refer the reader to the supplemental information of Haldane et al. (2016).
The scheme for choosing the Newton update step is important. Ferguson et al. (2013) developed a quasi-Newton parameter update approach which determines the updates to and by inverting the system’s Jacobian. In order to simplify and speed up the computation, we take advantage of the gauge invariance of the Potts Hamiltonian. We use a fieldless gauge in which for all , and we compute the expected change in the model bivariate marginals (hereafter dropping the m subscript) due to a change in to the first order by:
Computing the Jacobian and inverting the linear system in Equation 5 to solve for the changes in and given the , is the challenging part of the computation. We choose the as:
with a small enough damping parameter such that the linear (and other) approximations are valid.
The computational cost of fitting model parameters for the smallest protein in our analysis, PR, on 2 NVIDIA K80 or 4 NVIDIA TitanX GPUs is . The methodology followed in this analysis is almost the same as done in Flynn et al. (2017) (see Materials and methods) for HIV-1 PR. For a more detailed description of data preprocessing, model inference, and comparison with other methods, we refer the reader to the SI and text of Flynn et al. (2017); Haldane et al. (2016); Haldane et al. (2018); Haldane and Levy (2019).
A repository containing the inference methodology code is available at https://github.com/ComputationalBiophysicsCollaborative/IvoGPU and the final MSAs are available at https://github.com/ComputationalBiophysicsCollaborative/elife_data (copy archived at https://github.com/elifesciences-publications/elife_data). Appendix 1 contains a figure showing the sequence coverage for RT and tables of most entrenched and least entrenched sequences with same Hamming distances, for some important DRMs in PR, RT and IN.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.