To study the likelihood of various mutational pathways leading from the germline to the somatic sequence, we must assume a selection model. Selection in germinal centers is considerably more complex than in classical population genetics models, involving spatial structure, changing population sizes, and T-cell-mediated selection, among other factors (Mesin et al., 2016). Capturing these aspects in quantitative models is an active field of research (Amitai et al., 2017). However, here we wish to adopt an extremely simple model of selection as a first step in understanding the impacts of the binding affinity landscape on antibody selection, with the goal of understanding the implications of the expectation that mutational steps become more probable as their effect on binding affinity becomes more positive. Combining the more realistic models of immune selection with our detailed characterization of mutational effects on antigen binding affinity remains an interesting avenue for future work.
Here, we restrict to the weak-mutation regime where mutation fixation events occur independently of one another. Selection proceeds as a Markov process, where the population is characterized by a single sequence that acquires a single mutation at each discrete step (McCandlish, 2011). We choose a simple form for the fixation probability of a mutation from sequence to sequence , as discussed below. This then determines the transition probability for the population to move from to . We assume that sequences cannot back-mutate (i.e. a residue changing from the somatic allele to the germline allele), and do not acquire multiple mutations in the same step. The absence of back-mutation is justified by the relatively large number of possible mutation sites compared to the total number of mutation events.
We define the transition probability of a single mutational step from the classical fixation probability for a mutation with selection coefficient in a population of size (Kimura, 1962):
Here, we define the selection coefficient to be proportional to the difference in log binding affinities to a particular antigen between the two sequences and :
This model has two tunable parameters: represents the effective population size and represents how strongly differences in binding affinity impact fitness. We chose three parameter values to span a range of selection strengths (see Figure 5āfigure supplement 1): moderate, with and ; weak, with and ; and strong, with and such that reduces to a step function (one if and 0 otherwise). These three models all show similar results, with differences between selection scenarios becoming more exaggerated with stronger selection and less exaggerated with weaker selection, as expected (see Figure 5āfigure supplement 1).
From the fixation probabilities for a given parameter regime, we have the transition probability (up to a constant factor) for all sequences over all antigens ag,
which we use for all the calculations described below for results presented in Figure 5 and supplements.
It is particularly useful to store the probabilities as (sparse) transition matrices of dimension for each antigen, where entries are nonzero only where sequence has one more somatic mutation than .
First, we wish to obtain a measure of total probability for a particular antigen scenario, as shown in Figure 5E,F. We calculate this by computing the matrix product over all mutational steps for a particular sequence of antigen contexts :
where corresponds to taking the matrix element in the row corresponding to variant and column corresponding to variant . In the right-most term, the products are matrix operations and , are respectively the indices of the germline and somatic variants.
We note that the transition probabilities are not normalized at each step. In practice, this means that mutations are optional: many outcomes will not reach the somatic sequence and the likelihood encodes the probability of reaching the somatic state. This makes it possible to compare different scenarios, as some scenarios are more likely than others to reach the somatic state. However, because these values do not represent true probabilities ā the units are arbitrary ā they cannot be compared between antibodies or between selection models. The exception is for the strong scenario, where the total probability for each path is one if all steps are uphill () and 0 otherwise. Thus, here has a natural interpretation as the total number of uphill paths. When we present results from the strong model (Figure 5C,D, Figure 5āfigure supplement 1, and numbers of uphill paths for H1-only scenarios as discussed in the text), we represent uphill path numbers on a linear scale without log-transforming.
Although there are many possible antigen exposure scenarios, we restrict our analysis to several classes. First, in single-antigen scenarios, all steps use the same antigen. Second, for sequential scenarios, antigen exposures must occur in non-repeating segments (for example, H1 - H3 - H1 is not allowed), although we consider all possible lengths and orders of segments.
Mixed scenarios are more complicated, as we do not fully understand the nature of B cell interactions with multiple antigens in the same germinal center (Wang et al., 2015; Wang, 2017; Kuraoka et al., 2016). One option is to assume that the B cell engages the antigen for which it has the highest affinity and define by the maximum binding affinity across all possible antigens at each step, but this definition would trivially imply that the mixed scenario has the highest probability. Instead, we choose two alternatives: first, āaverageā mixed, where we assume the B cell engages all antigens and use the average binding affinity change over all three (for CR9114) or two (for CR6261) antigens, ; and second, ārandomā mixed, where we assume the B cell randomly engages a single antigen and hence the antigen at each mutational step is chosen randomly. For the latter definition, we calculate as described above for 1000 randomly drawn scenarios and average the resulting log probability. When we illustrate mutational paths and mutation orders, we choose a representative scenario (with close to median probability) from the 1000 random draws.
We estimate the error of these probabilities by bootstrapping. Specifically, for 10 bootstrap iterations, we resample each binding affinity from a normal distribution according to its value and standard deviation. We then recalculate the total probability , average over the 10 values to obtain mean and s.e.m. values, and transform by the natural log for plotting, as shown in Figure 5 and Figure 5āfigure supplement 1. We note that for the strong selection scenario (where probabilities represent total numbers of uphill paths), values are not log-transformed, and many scenarios have total path numbers of exactly zero. We refrain from studying the āaverageā mixed scenario for strong selection because it is essentially equivalent to choosing the antigen with maximum improvement: the quantitative effect of averaging is undone when the transition probability is binarized. For CR6261, all mutations at the first mutational step are neutral (with the exception of one mutation that improves affinity for H1 only), and so we allow all mutations with equal probability for the first step in the strong selection model.
Next, to identify the most likely paths under a given exposure scenario, we reframe this Markov process as a directed weighted graph. Each sequence is a node, and a directed edge exists toward all sequences that can be reached by one additional somatic mutation. The edge weight is calculated from the transition probability, , where is an extremely small value to ensure weights are finite. In this graph framework, we can use fast algorithms to obtain the āshortestā paths from the germline to the somatic node (those for which the sum of weights is lowest, thatĀ is the total probability is highest). Specifically, we use the shortest_simple_paths function from the Python package networkx (Hagberg et al., 2008) to compute the shortest paths, as shown in Figure 5G,H. This method is exact and uses the algorithm described in Yen, 1971.
Next, we wish to obtain the probability that a mutation at site happened at a specific step (Figure 5I,J). As we are focusing on one antigen context, we can normalize the transition matrices and define:
if and 0Ā otherwise. We can further restrict the transition matrix at step , , to have nonzero the mutation that occurs is at a particular residue , . The total relative probability for that site at that mutational step under an antigen exposure scenario is then
where, again, products are matrix operations. Because a sequence of steps starting from the germline can only lead to the somatic state, verifies . With the relation this implies that these probabilities are already normalized: .
Finally, we wish to determine the total probability of each variant (Figure 5āfigure supplement 2), that isĀ the sum of probabilities of all paths passing through that variant, for a given selection scenario. For a variant that contains somatic mutations, we calculate
where the first term is the probability of reaching sequence at mutational step , and the second term is the probability of reaching the somatic sequence after passing through sequence . When representing this number we add an additional normalisation factor, , where is the number of sequences with mutations, so that variants with different numbers of mutations have comparable values. thus represents the ratio of the probability in a selective model to the probability in a neutral model (which is . Thus, sequences with are favored by the given selection scenario, and those with are disfavored, as shown in Figure 5āfigure supplement 2 for moderate selection under the optimal sequential scenario.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.