Unphased reads assignment through unique k-mer similarity analysis

Can Luo; Yichen Henry Liu; Xin Maizie Zhou

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Unphased reads assignment through unique k-mer similarity analysis

CL Can Luo

YL Yichen Henry Liu

XZ Xin Maizie Zhou

This method is extracted from research article: Nat Commun, Aug 2024

VolcanoSV enables accurate and robust structural variant calling in diploid genomes from single-molecule long read sequencing

DOI: 10.1038/s41467-024-51282-0

Request a Protocol

Ask a question

Favorite

To accurately assign the unphased reads to the corresponding haplotype, a unique k-mer similarity-based cost-efficient approach is designed (Fig. 1). The underlying mechanism relies on the fact that each haplotype (within each phase block) is composed of distinct sets of mutation events (SNPs, small indels, and SVs) and these events allow us to extract a haplotype-specific unique k-mer set to represent it. VolcanoSV can then assign unphased reads to the specific haplotype by comparing the correspondence between each unphased read and nearby haplotypes relying on unique k-mer sets similarity. iVolcanoSV utilizes every two adjacent phase blocks of phased reads to define unique k-mers and extract haplotype-specific unique k-mer sets for all four haplotypes. Unique k-mers are defined as ones only appearing in one of four haplotypes. For each unphased read, VolcanoSV also extracts k-mers and then quantifies the percentage of its unique k-mers which are assigned to each haplotype of two adjacent phase blocks. If an unphased read is originally drawn from one specific haplotype, it is expected to see a high correspondence between the unphased read and the haplotype. VolcanoSV uses an empirical distribution quantile-based significance test to quantify the correspondence based on four calculated percentages and then assigns each unphased read. If an unphased read cannot be assigned when the significance test does not pass the criterion, VolcanoSV assumes this read to be drawn from both haplotypes and assigns it to both haplotypes of its nearest phase block. At the end of this module, unphased reads are partitioned to the corresponding haplotype for each phase block. The detailed methods are described as follows:

Firstly, VolcanoSV assigns each unphased read to its candidate phase blocks. The criteria for determining the candidate phase block are as follows:

As a result, every unphased read is assigned to at least one phase block and two PS_HPs. “PS_HP” refers to one haplotype of a phase block in the following context.

Secondly, to determine which PS_HP (haplotype of a phase block) the unphased read is drawn from, a unique k-mer similarity-based analysis is performed. VolcanoSV first collects a raw k-mer set for every candidate PS_HP. The raw k-mer set for a PS_HP is the union of k-mers from all phased reads belonging to this PS_HP. When collecting k-mers, the length of kmer is set to 12, and step size is set to 1 by default. Next, VolcanoSV creates “fingerprint” (unique) k-mer sets that are exclusive to every PS_HP. The fingerprint k-mer set of a PS_HP is defined as the intersection between the raw k-mer set of this PS_HP and the symmetric difference among all candidates PS_HPs. For example, if an unphased read has 4 candidate PS_HPs, of which the raw k-mer sets are R1, R2, R3, R4, then the symmetric difference among them is

The fingerprint (unique) k-mer set of each PS_HP is defined as follows

Denoting the k-mer set of the unphased read as S, the unique k-mer similarity metrics between the unphased read and the four candidates PS_HPs are then defined as the size of their k-mer sets intersections.

The normalized similarity metrics for the four candidates PS_HPs are calculated as follows

VolcanoSV repeats this procedure for all unphased reads and their candidate PS_HPs. VolcanoSV thus collects all normalized similarity metrics, forming a normalized similarity vector χ. To finally determine which PS_HP the unphased read is drawn from, VolcanoSV utilizes an empirical distribution quantile-based significance test to evaluate the normalized similarity metrics between unphased reads and candidate PS_HPs. A level r (10% by default) is used, and the cut-off threshold for significance is the (1 − r) quantile of the normalized similarity vector χ. Metrics exceeding this threshold are considered significant, and reads are assigned accordingly. The null hypothesis (H0) posits that the normalized similarity metric between the unphased read and the candidate PS_HP is not significantly different from what would be expected by random chance, i.e., ${NormSim}_{i} \leq Q_{1 - r} (χ)$ . For each unphased read, we compare its normalized similarity metric to the cut-off Q_1−r(χ). If the metric is higher than the cut-off, it is considered significant, suggesting a potential association with the corresponding PS_HP. Conversely, if a normalized similarity metric for an unphased read does not exceed the cut-off, we fail to reject the null hypothesis for that specific read and candidate PS_HP combination, implying that there is no significant association and the observed similarity might be due to random chance. If an unphased read can not be assigned to any candidate PS_HP based on the significance test, it will be partitioned to both haplotypes of its nearest phase block.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol