Dynamic genome warping

Saulius Lukauskas; Roberto Visintainer; Guido Sanguinetti; Gabriele B. Schweikert

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Dynamic genome warping

SL Saulius Lukauskas

RV Roberto Visintainer

GS Guido Sanguinetti

GS Gabriele B. Schweikert

This method is extracted from research article: BMC Bioinformatics, Dec 2016

DGW: an exploratory data analysis tool for clustering and visualisation of epigenomic marks

DOI: 10.1186/s12859-016-1306-0

Request a Protocol

Ask a question

Favorite

To automatically quantify the similarities between peaks such as the ones shown in Fig. Fig.11 we use the classic Dynamic Time Warping (DTW) algorithm [11,14]. A modern review of the basic concepts of Dynamic Time Warping can be found e.g. in [12]. It was originally introduced in the speech recognition community to robustly recognize speech independently of speech speed. There, the problem was to match waveforms of similar shape but potentially different duration. Likewise, our aim is to be able to associate peaks which exhibit similar local structure (shape) regardless of their spatial extension.

Specifically, let a=(a ₁,…,a _N) and b=(b ₁,…,b _M) be two sequences with values a_i, b_i ∈ 𝒮, where 𝒮 is a metric space equipped with local distance d:𝒮 × 𝒮 → ℝ (e.g. squared Euclidean distance or Cosine distance). DTW uses dynamic programming to construct a warping path $p = (p_{1}^{0}, p_{1}^{1}), \dots, (p_{i}^{0}, p_{i}^{1}), \dots, (p_{K}^{0}, p_{K}^{1})$ , i.e. two sets of indices identifying the elements of the two sequences which are mapped to each other in order to minimise the sum of the local distances. In formulae,

subject to the following constraints

$p_{1}^{0} = p_{1}^{1} = 1$ , the first points of both sequences are mapped to each other;

$p_{K}^{0} = N$ , $p_{K}^{1} = M$ , the end points of both sequences are mapped to each other;

$0 \leq p_{i + 1}^{j} - p_{i}^{j} \leq 1$ for all i=1,…,K and j=0,1, each index set is non-decreasing with maximum step one. This ensures that every point in each sequence gets mapped to at least one point on the other sequence.

Algorithmically, DTW is very similar to the classical alignment algorithms such as Needleman-Wunsch and Smith-Waterman: it assumes an optimal alignment between subsequences, iterates by selecting the optimal next move and recovers the optimal global alignment by backtracking. As such, it entails constructing a matrix of size M×N, which determines the computational complexity of the algorithm: Computing pairwise DTW distances between all peaks is therefore the computationally most expensive step, as it involves computing $O (N_{peaks}^{2})$ DTW distances, each of which is O(M×N). In Fig. Fig.22 we show how the first two peaks in Fig. Fig.11 are aligned onto each other using DTW. Notice that the pure DTW algorithm allows arbitrarily long stretches to be compressed to a single point. This behaviour may be undesirable, and simple modifications are implemented such as an upper limit on the length of compressed regions (Sakoe-Chiba band [11]), or an exponential penalty on compressing/stretching. By applying the Sakoe-Chiba Band constraint we can also reduce the run-time to O(k×m a x(N,M)), where k is the width of the band, that can be chosen to be small. Novel strategies to reduce the computational load are however emerging [15], and it would be interesting to integrate such ideas in the epigenomic context.

DGW alignment of two H3K4me3 profiles. a Shown is the distance density matrix for two peaks (Peak 188 on the x-axis and Peak 280 on the y axis). Colour coding corresponds to local Euclidean distances from small (green) to large (red). Optimal path is shown in blue. b Mapping between the two profiles. c Dynamically aligned profiles and total distance D between the two peaks

DGW readily extends to multi-dimensional data if more than one epigenomic track is analysed: In this case a and b become sequences of vectors, e.g. (a ₁,…,a _N), that each contain the coverage of each mark at time point i. In this way, the different epigenomic marks are given equal weight, however other weighting schemes can easily be implemented.

In addition to the optimal path between two sequences we also report their total distance under the optimal warping which will subsequently be used for the clustering of peaks. Note, when using squared Euclidean distance as local distance measure, both, differences in peak shapes as well as in enrichment levels contribute to the overall DTW distance. If this is not desired the peaks can optionally be normalized by the respective peak heights, and the Cosine distance can be used as local distance. To account for potential strand specificity of epigenomic marks we compute two distances for every pairwise peak comparison: one with the two sequences unchanged, and one with one of them reversed. The smaller distance between the two is then returned as the true distance between the patterns.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol