Integrated and Modular Pipeline for Antibody Repertoire Simulation

XY Xiujia Yang
YZ Yan Zhu
SC Sen Chen
HZ Huikun Zeng
JG Junjie Guan
QW Qilong Wang
CL Chunhong Lan
DS Deqiang Sun
XY Xueqing Yu
ZZ Zhenhai Zhang
request Request a Protocol
ask Ask a question
Favorite

IMPlAntS (Integrated and Modular Pipeline for Antibody Repertoire Simulation) was developed to as much as possible mimic real-world antibody repertoires and meet the requirements (i.e. minor allele frequency control and NGS data simulation) in this study.

As mentioned in Results, IMPlAntS consists of three consecutive steps, i) generation of independent V(D)J rearrangements; ii) generation of BCRs with SHMs of proper phylogenetic structure within clones; and iii) generation of NGS reads incorporating base errors ( Figure 2 ). These steps can be implemented individually or collectively using the corresponding scripts hosted on github (https://github.com/Xiujia-Yang/IMPlAntS).

In the first step, a customizable number of independent rearranged sequences are in silico simulated by considering two major features of the real-world rearrangement repertoire: preferential gene usage and junctional nucleotide modification (P and N nucleotide insertions and deletions). To investigate the influence of allelic diversity on novel allele identification, we equipped IMPlAntS with the ability to simulate alleles of a certain gene with varied ratios (only two alleles are supported), which can be customized by modifying the gene usage configuration file. Moreover, IMPlAntS also allows simulation of nonproductive rearrangements and their percentages in antibody repertoire can be specified by users for specific aims. Notably, the four simulated datasets in this study include only productive rearrangements.

The second step can be further divided into two stages: generation of clonally related sequences with proper phylogenetic structure and various numbers for each sequence. Clonally related sequences are created by a certain number of iterations (to mimic the affinity maturation of real-world antibody sequence) where SHMs are induced for randomly selected sequences across the variable region based on the positional mutability and substitution models similar to Yermanos et al. (13). In each iteration, a fraction of sequences in the current sequence pool are randomly selected for SHM simulation and new sequences with simulated SHMs will be added into the current sequence pool that will be subjected to random selection in the next iteration. Independent rearranged sequences serve as the input in the first iteration. Because the positional mutability model stores mutation probabilities for different positions observed in end repertoires (repertoires containing sequences have already undergone multiple rounds of maturations), a parameter named ‘—mut_ability_fold’ (less than 1) is introduced here to prevent the generation of hyper-mutated sequences after a number of iterations. Iterations above produce nonredundant clonally related sequences. Then selective sequences will be populated according to the power law (20) to mimic the clonal expansion of B cells with a various number of replicates. The key parameters in this step, including the number of iterations, the maximum number of sequences, the alpha value of the power law, and the largest size of sequences, are customizable. ART is employed in the last step to produce NGS data with Illumina MiSeq system settings.

For the above steps, parameters of gene usage, junctional modification, positional mutability and substitution models, were obtained from a population-level antibody repertoire study (14) and are set as defaults of IMPlAntS. Gene usage is calculated as the percentage of clones (sets of sequences sharing the same V and J gene and CDR3 nucleotide sequence) in a repertoire recombined from a certain gene. In this study, V, D and J gene usages are taken from normalized medians of gene usages from 2152 antibody repertoires of 582 donors. Junctional modification parameters consist of 10 entities (i.e. V3D, V3P, N1, D5D, D5P, D3D, D3P, N2, J5D and J5P (D, deleted nucleotide; P, palindromic nucleotide; N, nontemplated nucleotide), as demonstrated also in Figure 2 ). The probabilities of modification lengths for each of these entities are derived from the observation of a combination of 2152 antibody repertoires of 582 donors. The positional mutability and substitution models were obtained from IgG repertoires of PBMC from 353 healthy donors. All parameters above can be found on the github and are set as defaults by IMPlAntS. Supplementary Figure 5 and Supplementary Figure 3B show the approximation of the real-world repertoire for repertoires in the four simulated datasets in this study.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A