Clustering

Jingwen Ren; Mark J. P. Chaisson; Jian Ma; Ferhat Ay; Jian Ma; Ferhat Ay; Jian Ma; Ferhat Ay

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Clustering

JR Jingwen Ren

MC Mark J. P. Chaisson

JM Jian Ma

FA Ferhat Ay

JM Jian Ma

FA Ferhat Ay

JM Jian Ma

FA Ferhat Ay

This method is extracted from research article: PLoS Comput Biol, Jun 2021

lra: A long read aligner for sequences and contigs

DOI: 10.1371/journal.pcbi.1009078

Request a Protocol

Ask a question

Favorite

Although CG-SDP can be applied to all anchors A, for efficiency a greedy approach is used to cluster anchors that would likely be together on an optimal chain. These clusters may be used to filter out spurious matches in low accuracy reads, or may be chained directly on high accuracy reads and contig mapping. When forming alignments from chained clusters, it is necessary to have a cluster refining step that divides rough clusters into non-overlapping fine clusters to avoid chaining that skips biological variation in repetitive sequences.

Rough clustering partitions anchors into clusters representing approximate intervals on the query and target that are aligned (Fig 3a), and serves to exclude noisy anchors unlikely to be chained in an alignment by CG-SDP. Denoting the forward diagonal of each anchor β_i as f_i = y_i − x_i, and the reverse diagonal r_i = x_i + y_i, a sorted anchor order O = [o₁, …, o_n] is defined by ordering anchors by forward diagonal and then x coordinate. A reverse sorted order $O^{r e v} = [o_{1}^{r e v}, \dots, o_{n}^{r e v}]$ is similarly defined sorting on reverse diagonal and x coordinate. This will be used to detect alignments on the reverse strand, but because the operations are the same as on the forward strand, only subsequent steps using the forward sorted order are given. The set of rough clusters is defined by partitioning O into non-overlapping intervals such that every anchor indexed in an interval has a diagonal within D^R of the preceding anchor in the interval. Intervals are greedily assigned with first interval starting at the first index in O, and subsequent intervals starting on after a gap of more than D^R between anchors. Intervals with few elements (defined by a minClusterSize parameter) are discarded, and the rough clusters R = {R₁, …, R_NR} are defined from the set of anchors included in each interval. The value of D^R is chosen so that rough clusters are likely to contain at least minClusterSize true anchors from a read (default to 3 for CLR and ONT, and 10 for HiFi/contigs). For CLR and ONT data, we empirically determined a sufficient D^R is 200, and 150 for HiFi and assembly contig alignment. For low accuracy reads, chains are formed by running CG-SDP on all matches retained in rough clustering. For high accuracy reads, the clusters must be post-processed with fine-clustering prior to being chained.

a, Two rough clusters (blue and orange), which are far from each other on the reference. b, The initial four fine clusters defined from the contiguous stretches of unique anchors. c, fine cluster-1 and fine cluster-2 are merged because their diagonal difference is smaller than D^F and projected distances between their endpoints is smaller than G^dist. fine cluster-3 and fine cluster-4 are not merged due to the large diagonal difference. d, Non-unique anchors in the trapezoid between fine cluster-1 and fine cluster-2 are added to the merged fine cluster-1, along with non-unique anchors in the trapezoid defined by the start of rough cluster-2 and the start of fine cluster-3. e, Three fine clusters are obtained after rough clustering and fine clustering. f-h, Splitting of overlapping fine-clusters: f, overlap of clusters. g, Boundaries of split clusters defined by a start (red dot) and an end (blue dot). h, the optimal chain of split super-fragments.

For mapping hich accuracy reads, each rough cluster is processed independently by dividing into non overlapping fine clusters, where each fine cluster consists of anchors on a close diagonal D^F, with endpoints that do not overlap. The first step of CG-SDP will be applied to chain the fine clusters and find an approximate alignment between q and t. Each fine cluster C_j is defined by all of the anchors contained in the cluster, and endpoints $(x_{j}^{s}, y_{j}^{s})$ and $(x_{j}^{e}, y_{j}^{e})$ , where $x_{j}^{s}$ , $y_{j}^{s}$ are the minimum x, y coordinates of the starting points of all the anchors in C_j, and $x_{j}^{e}$ , $y_{j}^{e}$ are the maximum x, y coordinates of the ending points of all the anchors in C_j. To define the fine clusters, anchors in each rough cluster are first sorted by Cartesian coordinate. Within each rough cluster, an anchor is defined as unique when the k-mer of the match is not repeated in the cluster. Fine clusters are initialized as runs of unique anchors in the Cartesian ordering that are on a close diagonal, and the distance between the end of one anchor and the start of the next is small (Fig 3b). Every pair of fine clusters C_j and C_k are merged if their endpoints have diagonal differences smaller than D^F and are within Cartesian distance G^dist (Fig 3c), and all non-unique anchors within the trapezoid defined by $[(x_{j}^{e}, y_{j}^{e} - D^{F}), (x_{j}^{e}, y_{j}^{e} + D^{F}), (x_{k}^{s}, y_{k}^{s} - D^{F}), (x_{k}^{s}, y_{k}^{s} + D^{F})]$ are included into the merged fine cluster (Fig 3d and 3e). The remaining non-unique anchors that are not added into the fine cluster are discarded. In the step of fine clustering, we empirically found D^F = 500 was able to distinguish clustering anchors in different tandem repeats and allowed a sufficient number of repetitive anchors to be included in the fine clusters.

Copyright and License information: ©2021 Ren, Chaisson ©2021 Ma, AyThis is an open access article distributed under the terms of the , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. ©2021 Ma, AyThis is an open access article distributed under the terms of the , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. ©2021 Ma, AyThis is an open access article distributed under the terms of the , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol