Pathway analysis

Angela M Phillips; Katherine R Lawrence; Alief Moulana; Thomas Dupic; Jeffrey Chang; Milo S Johnson; Ivana Cvijovic; Thierry Mora; Aleksandra M Walczak; Michael M Desai

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Pathway analysis

AP Angela M Phillips

KL Katherine R Lawrence

AM Alief Moulana

TD Thomas Dupic

JC Jeffrey Chang

MJ Milo S Johnson

IC Ivana Cvijovic

TM Thierry Mora

AW Aleksandra M Walczak

MD Michael M Desai

This method is extracted from research article: eLife, Sep 2021

Binding affinity landscapes constrain the evolution of broadly neutralizing anti-influenza antibodies

DOI: 10.7554/eLife.71393

Request a Protocol

Ask a question

Favorite

To study the likelihood of various mutational pathways leading from the germline to the somatic sequence, we must assume a selection model. Selection in germinal centers is considerably more complex than in classical population genetics models, involving spatial structure, changing population sizes, and T-cell-mediated selection, among other factors (Mesin et al., 2016). Capturing these aspects in quantitative models is an active field of research (Amitai et al., 2017). However, here we wish to adopt an extremely simple model of selection as a first step in understanding the impacts of the binding affinity landscape on antibody selection, with the goal of understanding the implications of the expectation that mutational steps become more probable as their effect on binding affinity becomes more positive. Combining the more realistic models of immune selection with our detailed characterization of mutational effects on antigen binding affinity remains an interesting avenue for future work.

Here, we restrict to the weak-mutation regime where mutation fixation events occur independently of one another. Selection proceeds as a Markov process, where the population is characterized by a single sequence that acquires a single mutation at each discrete step (McCandlish, 2011). We choose a simple form for the fixation probability of a mutation from sequence $s$ to sequence $t$ , as discussed below. This then determines the transition probability for the population to move from $s$ to $t$ . We assume that sequences cannot back-mutate (i.e. a residue changing from the somatic allele to the germline allele), and do not acquire multiple mutations in the same step. The absence of back-mutation is justified by the relatively large number of possible mutation sites compared to the total number of mutation events.

We define the transition probability of a single mutational step from the classical fixation probability for a mutation with selection coefficient $σ$ in a population of size $N$ (Kimura, 1962):

Here, we define the selection coefficient $σ$ to be proportional to the difference in log binding affinities to a particular antigen between the two sequences $s$ and $t$ :

This model has two tunable parameters: $N$ represents the effective population size and $γ$ represents how strongly differences in binding affinity impact fitness. We chose three parameter values to span a range of selection strengths (see Figure 5—figure supplement 1): moderate, with $N = 1000$ and $γ = 1$ ; weak, with $N = 20$ and $γ = 0.5$ ; and strong, with $N \to \infty$ and $γ \to \infty$ such that $p_{step}$ reduces to a step function (one if $Δ > 0$ and 0 otherwise). These three models all show similar results, with differences between selection scenarios becoming more exaggerated with stronger selection and less exaggerated with weaker selection, as expected (see Figure 5—figure supplement 1).

From the fixation probabilities for a given parameter regime, we have the transition probability (up to a constant factor) for all sequences $s, t$ over all antigens ag,

which we use for all the calculations described below for results presented in Figure 5 and supplements.

It is particularly useful to store the probabilities $P_{s, t}^{ag}$ as (sparse) transition matrices $P^{ag}$ of dimension $2^{N} \times 2^{N}$ for each antigen, where entries are nonzero only where sequence $t$ has one more somatic mutation than $s$ .

First, we wish to obtain a measure of total probability for a particular antigen scenario, as shown in Figure 5E,F. We calculate this by computing the matrix product over all mutational steps $i$ for a particular sequence of antigen contexts ${{ag}_{1}, \dots, {ag}_{L}}$ :

where ${[\cdot]}_{s, s^{'}}$ corresponds to taking the matrix element in the row corresponding to variant $s$ and column corresponding to variant $s^{'}$ . In the right-most term, the products are matrix operations and $s_{g}$ , $s_{s}$ are respectively the indices of the germline and somatic variants.

We note that the transition probabilities $P^{{ag}_{i}}$ are not normalized at each step. In practice, this means that mutations are optional: many outcomes will not reach the somatic sequence and the likelihood encodes the probability of reaching the somatic state. This makes it possible to compare different scenarios, as some scenarios are more likely than others to reach the somatic state. However, because these values do not represent true probabilities — the units are arbitrary — they cannot be compared between antibodies or between selection models. The exception is for the strong scenario, where the total probability for each path is one if all steps are uphill ( $Δ_{s, t}^{ag} > 0$ ) and 0 otherwise. Thus, here $𝒫_{t o t}$ has a natural interpretation as the total number of uphill paths. When we present results from the strong model (Figure 5C,D, Figure 5—figure supplement 1, and numbers of uphill paths for H1-only scenarios as discussed in the text), we represent uphill path numbers on a linear scale without log-transforming.

Although there are many possible antigen exposure scenarios, we restrict our analysis to several classes. First, in single-antigen scenarios, all steps $i$ use the same antigen. Second, for sequential scenarios, antigen exposures must occur in non-repeating segments (for example, H1 - H3 - H1 is not allowed), although we consider all possible lengths and orders of segments.

Mixed scenarios are more complicated, as we do not fully understand the nature of B cell interactions with multiple antigens in the same germinal center (Wang et al., 2015; Wang, 2017; Kuraoka et al., 2016). One option is to assume that the B cell engages the antigen for which it has the highest affinity and define $Δ$ by the maximum binding affinity across all possible antigens at each step, but this definition would trivially imply that the mixed scenario has the highest probability. Instead, we choose two alternatives: first, ‘average’ mixed, where we assume the B cell engages all antigens and use the average binding affinity change over all three (for CR9114) or two (for CR6261) antigens, $Δ_{mixed} = \frac{1}{N_{ag}} \sum_{ag} Δ_{ag}$ ; and second, ‘random’ mixed, where we assume the B cell randomly engages a single antigen and hence the antigen at each mutational step is chosen randomly. For the latter definition, we calculate $𝒫_{t o t}$ as described above for 1000 randomly drawn scenarios and average the resulting log probability. When we illustrate mutational paths and mutation orders, we choose a representative scenario (with close to median probability) from the 1000 random draws.

We estimate the error of these probabilities by bootstrapping. Specifically, for 10 bootstrap iterations, we resample each binding affinity $- \log_{10} K_{D, s}^{ag}$ from a normal distribution according to its value and standard deviation. We then recalculate the total probability $𝒫_{t o t}$ , average over the 10 values to obtain mean and s.e.m. values, and transform by the natural log for plotting, as shown in Figure 5 and Figure 5—figure supplement 1. We note that for the strong selection scenario (where probabilities represent total numbers of uphill paths), values are not log-transformed, and many scenarios have total path numbers of exactly zero. We refrain from studying the ‘average’ mixed scenario for strong selection because it is essentially equivalent to choosing the antigen with maximum improvement: the quantitative effect of averaging is undone when the transition probability is binarized. For CR6261, all mutations at the first mutational step are neutral (with the exception of one mutation that improves affinity for H1 only), and so we allow all mutations with equal probability for the first step in the strong selection model.

Next, to identify the most likely paths under a given exposure scenario, we reframe this Markov process as a directed weighted graph. Each sequence $s$ is a node, and a directed edge exists toward all sequences $t$ that can be reached by one additional somatic mutation. The edge weight is calculated from the transition probability, $w_{s \to t} = - \log (P_{s, t}^{ag} + ϵ)$ , where $ϵ$ is an extremely small value to ensure weights are finite. In this graph framework, we can use fast algorithms to obtain the ‘shortest’ paths from the germline to the somatic node (those for which the sum of weights is lowest, that is the total probability is highest). Specifically, we use the shortest_simple_paths function from the Python package networkx (Hagberg et al., 2008) to compute the $k$ shortest paths, as shown in Figure 5G,H. This method is exact and uses the algorithm described in Yen, 1971.

Next, we wish to obtain the probability that a mutation at site $m$ happened at a specific step $j$ (Figure 5I,J). As we are focusing on one antigen context, we can normalize the transition matrices and define:

if $P_{s, t}^{ag} \neq 0$ and 0 otherwise. We can further restrict the transition matrix at step $j$ , ${\tilde{P}}^{{ag}_{j}}$ , to have nonzero the mutation that occurs is at a particular residue $α$ , ${\tilde{P}}_{α}^{{ag}_{j}}$ . The total relative probability for that site at that mutational step under an antigen exposure scenario is then

where, again, products are matrix operations. Because a sequence of $L$ steps starting from the germline can only lead to the somatic state, $\tilde{P}$ verifies ${[\prod_{i = 1}^{L} {\tilde{P}}^{{ag}_{i}}]}_{s_{g}, s_{s}} = 1$ . With the relation $\sum_{α} {\tilde{P}}_{α}^{{ag}_{j}} = {\tilde{P}}^{{ag}_{j}}$ this implies that these probabilities are already normalized: $\sum_{α} 𝒫_{j, α} = 1$ .

Finally, we wish to determine the total probability of each variant (Figure 5—figure supplement 2), that is the sum of probabilities of all paths passing through that variant, for a given selection scenario. For a variant $s$ that contains $j$ somatic mutations, we calculate

where the first term is the probability of reaching sequence $s$ at mutational step $j$ , and the second term is the probability of reaching the somatic sequence after passing through sequence $s$ . When representing this number we add an additional normalisation factor, $𝒫_{s}^{'} = 𝒫_{s} \times n_{j}$ , where $n_{j} = (\binom{L}{j})$ is the number of sequences with $j$ mutations, so that variants with different numbers of mutations have comparable values. $𝒫_{s}^{'}$ thus represents the ratio of the probability in a selective model to the probability in a neutral model (which is $1 / n_{j})$ . Thus, sequences with $\log_{10} (𝒫_{s}^{'}) > 0$ are favored by the given selection scenario, and those with $\log_{10} (𝒫_{s}^{'}) < 0$ are disfavored, as shown in Figure 5—figure supplement 2 for moderate selection under the optimal sequential scenario.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol