# Also in the Article

Prediction of PRDM9 binding sites and SFS construction

Procedure

To predict PRDM9-A binding sites, we obtained empirical position weight matrices (PWMs) from (2). In (2), a number of different motifs are presented, but following that paper, we only used their motifs Human1, …, Human7 as true PRDM9-A binding motifs. For PRDM9-C binding sites, we obtained the PWMs from (62). These PWM matrices describe the probability pX(ℓ) of observing a nucleotide X ∈ {A, C, G, T} for each position ℓ in the motif. To determine a cutoff for whether to call a particular sequence as matching a particular binding motif or not, we generated 10,000,000 random nucleotide sequences by sampling each position independently, and drawing A or T with probability 0.3 and C or G with probability 0.2, which approximately matches the marginal distribution of nucleotides in the human genome. We then computed the log-likelihood, $logL$ of each sequence by$logL(i)≔∑ℓ=1Mlog[pXℓ(i)(ℓ)]$where $Xℓ(i)$ is the nucleotide at position ℓ in simulation i and M is the length of the motif. We chose the 9,999,990th largest log-likelihood as the cutoff for calling a motif. This is equivalent to an approximate P value of 1 × 10−6.

We then called PRDM9-A alleles in each haploid sequence in the 1KG dataset on the hg38 genome build as follows. We considered only diallelic SNPs where all individuals have reported genotypes. Sites with more than two alleles or structural variants were treated as missing. Individuals were treated as having the reference allele at all other positions. Then, starting at the first base in the genome, we computed the log-likelihood, as above, for each motif (or its reverse complement) starting at that position, reporting log-likelihoods that are greater than the empirical cutoff for that motif, and then moving to the next base and repeating. We skipped any starting points where any motif overlapped a missing position. Instead of performing this for each haploid individually, we instead constructed all of the unique haplotypes in the dataset that spanned the region from the starting position to the end of the longest motif and only computed the log-likelihood of each motif on these unique haplotypes.

To construct the PRDM9-A binding site SFS, we took these calls and looked for starting positions where some individuals were called as matching one of the PRDM9-A binding motifs, and other individuals were not predicted to bind any PRDM9-A motif. We then treated binding and nonbinding as the two alleles and constructed a standard diallelic SFS. We also constructed SFSs for each population by restricting to only sites with a recombination rate inferred in that population within some range. To insure that our results were due to PRDM9 binding and not due to other factors such as GC content in the motifs, we also repeated the above procedure with shuffled PWMs obtained by randomly permuting the positions of each PWM.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A