Unphased reads assignment through unique k-mer similarity analysis

CL Can Luo
YL Yichen Henry Liu
XZ Xin Maizie Zhou
request Request a Protocol
ask Ask a question
Favorite

To accurately assign the unphased reads to the corresponding haplotype, a unique k-mer similarity-based cost-efficient approach is designed (Fig. 1). The underlying mechanism relies on the fact that each haplotype (within each phase block) is composed of distinct sets of mutation events (SNPs, small indels, and SVs) and these events allow us to extract a haplotype-specific unique k-mer set to represent it. VolcanoSV can then assign unphased reads to the specific haplotype by comparing the correspondence between each unphased read and nearby haplotypes relying on unique k-mer sets similarity. iVolcanoSV utilizes every two adjacent phase blocks of phased reads to define unique k-mers and extract haplotype-specific unique k-mer sets for all four haplotypes. Unique k-mers are defined as ones only appearing in one of four haplotypes. For each unphased read, VolcanoSV also extracts k-mers and then quantifies the percentage of its unique k-mers which are assigned to each haplotype of two adjacent phase blocks. If an unphased read is originally drawn from one specific haplotype, it is expected to see a high correspondence between the unphased read and the haplotype. VolcanoSV uses an empirical distribution quantile-based significance test to quantify the correspondence based on four calculated percentages and then assigns each unphased read. If an unphased read cannot be assigned when the significance test does not pass the criterion, VolcanoSV assumes this read to be drawn from both haplotypes and assigns it to both haplotypes of its nearest phase block. At the end of this module, unphased reads are partitioned to the corresponding haplotype for each phase block. The detailed methods are described as follows:

Firstly, VolcanoSV assigns each unphased read to its candidate phase blocks. The criteria for determining the candidate phase block are as follows:

As a result, every unphased read is assigned to at least one phase block and two PS_HPs. “PS_HP” refers to one haplotype of a phase block in the following context.

Secondly, to determine which PS_HP (haplotype of a phase block) the unphased read is drawn from, a unique k-mer similarity-based analysis is performed. VolcanoSV first collects a raw k-mer set for every candidate PS_HP. The raw k-mer set for a PS_HP is the union of k-mers from all phased reads belonging to this PS_HP. When collecting k-mers, the length of kmer is set to 12, and step size is set to 1 by default. Next, VolcanoSV creates “fingerprint” (unique) k-mer sets that are exclusive to every PS_HP. The fingerprint k-mer set of a PS_HP is defined as the intersection between the raw k-mer set of this PS_HP and the symmetric difference among all candidates PS_HPs. For example, if an unphased read has 4 candidate PS_HPs, of which the raw k-mer sets are R1, R2, R3, R4, then the symmetric difference among them is

The fingerprint (unique) k-mer set of each PS_HP is defined as follows

Denoting the k-mer set of the unphased read as S, the unique k-mer similarity metrics between the unphased read and the four candidates PS_HPs are then defined as the size of their k-mer sets intersections.

The normalized similarity metrics for the four candidates PS_HPs are calculated as follows

VolcanoSV repeats this procedure for all unphased reads and their candidate PS_HPs. VolcanoSV thus collects all normalized similarity metrics, forming a normalized similarity vector χ. To finally determine which PS_HP the unphased read is drawn from, VolcanoSV utilizes an empirical distribution quantile-based significance test to evaluate the normalized similarity metrics between unphased reads and candidate PS_HPs. A level r (10% by default) is used, and the cut-off threshold for significance is the (1 − r) quantile of the normalized similarity vector χ. Metrics exceeding this threshold are considered significant, and reads are assigned accordingly. The null hypothesis (H0) posits that the normalized similarity metric between the unphased read and the candidate PS_HP is not significantly different from what would be expected by random chance, i.e., NormSimiQ1r(χ). For each unphased read, we compare its normalized similarity metric to the cut-off Q1−r(χ). If the metric is higher than the cut-off, it is considered significant, suggesting a potential association with the corresponding PS_HP. Conversely, if a normalized similarity metric for an unphased read does not exceed the cut-off, we fail to reject the null hypothesis for that specific read and candidate PS_HP combination, implying that there is no significant association and the observed similarity might be due to random chance. If an unphased read can not be assigned to any candidate PS_HP based on the significance test, it will be partitioned to both haplotypes of its nearest phase block.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A