A General Optimization Framework for Dynamic Time Warping

Almost all forms of group signal processing involve a co-registration step, where two signals are approximately aligned such that their amplitudes and phases can be compared. The classic algorithm for performing this alignment is called Dynamic time warping (DTW). DTW appears everywhere — speech recognition, financial time series, robotics, and perhaps most critically, biomedical signal processing. The core idea simple: find a way to “warp” time, i.e. speed up or slow down time, so that two signals align with each other.

But the classical DTW algorithm had a longstanding and unsolved problem: it produces sharp irregularities in the time warp function that cause many time points of one signal to be erroneously mapped onto a single point in the other signal. Such errors are known as singularities.

In this post, we'll introduce a general optimization framework that eliminates singularities without preprocessing hacks or ad-hoc step patterns. We pose DTW as a continuous-time optimal control problem, solve it efficiently via dynamic programming with iterative refinement, and demonstrate dramatically smoother alignments on real ECG signals.

Interactive Demo

Choose a dataset and adjust the sliders below to interactively explore how the smoothing and cumulative warp regularizers affect the results.

A sine wave warped by a quadratic time function φ(t) = t².

\lambda^{\text{inst}}

(Smoothness)1

Higher values result in smoother warping functions. Lower values may produce singularities.

\lambda^{\text{cuml}}

(Cumulative Warp)0.1

Higher values limit the total amount of warping, preventing over-fitting.

The problem with classical DTW

Suppose we have two signals $x$ and $y$ , and we want to find a time warping function $\phi$ such that the time-warped signal $\tilde{x} = x \circ \phi \approx y$ . The classical DTW algorithm uses dynamic programming to find a warping path that minimizes misalignments while satisfying monotonicity, boundary, and continuity constraints.

The monotonicity constraint ensures the path represents a monotone increasing function of time. The boundary constraint enforces that the warping begins at the origin of both signals and ends at their terminal points. The continuity constraint restricts transitions to adjacent points in time.

Despite its popularity, classical DTW has a persistent issue: the warping functions it produces are often jagged, with many time points of one signal mapping to a single point (a “singularity”) in the other.

Most approaches to reducing singularities fall into two camps:

Preprocessing the input signals — transformations like derivatives, square-root velocity functions, or wavelet transforms that indirectly influence warping smoothness
Step pattern variations — relaxing continuity constraints by expanding allowable transitions via “symmetric” or “asymmetric” patterns of various types

Both approaches are ad-hoc and often require hand-tuning for different signal types.

A continuous-time optimization framework

We propose handling these issues entirely within an optimization framework in continuous time. Instead of finding discrete paths, we pose DTW as choosing a time warp function $\phi: [0,1] \to [0,1]$ that minimizes:

f(\phi) = \mathcal{L}(\phi) + \lambda^{\text{cuml}} \cdot \mathcal{R}^{\text{cuml}}(\phi) + \lambda^{\text{inst}} \cdot \mathcal{R}^{\text{inst}}(\phi)

subject to $\phi(0) = 0$ and $\phi(1) = 1$ .

Let's unpack each term.

The loss functional

The loss measures how well the time-warped signal approximates the target:

\mathcal{L}(\phi) = \int_0^1 L(x(\phi(t)) - y(t)) \, dt

This is simply the average penalty for the difference between the time-warped first signal and the second signal. Common choices include:

Squared loss: $L(u) = \|u\|_2^2$
Absolute loss: $L(u) = \|u\|_1$
Huber loss: Squared for small deviations, linear for outliers

Cumulative warp regularization

The cumulative warp $\phi(t) - t$ measures how much the warped time deviates from the original time at each moment. We regularize this with:

\mathcal{R}^{\text{cuml}}(\phi) = \int_0^1 R^{\text{cuml}}(\phi(t) - t) \, dt

This term limits overfitting, similar to ridge or lasso regularization in machine learning. The larger $\lambda^{\text{cuml}}$ , the less $\tau$ deviates from $t$ .

Instantaneous warp regularization

The instantaneous rate of time warping $\phi'(t) - 1$ measures how quickly the warping is changing. We regularize this with:

\mathcal{R}^{\text{inst}}(\phi) = \int_0^1 R^{\text{inst}}(\phi'(t) - 1) \, dt

This is the key to eliminating singularities. By penalizing large instantaneous warping rates, we directly encourage smooth warping functions. The larger $\lambda^{\text{inst}}$ , the smoother the time warping function.

Constraints via infinite penalties

A clever aspect of this formulation: we can impose hard constraints by allowing the regularization functions to take the value $+\infty$ . For example, requiring that the slope $\phi'$ stay between bounds $S_{\min}$ and $S_{\max}$ :

R^{\text{inst}}(u) = \begin{cases} u^2 & S_{\min} \leq u \leq S_{\max} \\ \infty & \text{otherwise} \end{cases}

This elegantly unifies soft penalties and hard constraints in a single framework.

Solving via dynamic programming with refinement

The optimization problem we've formulated is a classical continuous-time optimal control problem with scalar state $\phi(t)$ and action $u(t) = \phi'(t)$ . While such problems are generally intractable to solve exactly, this particular problem — with state dimension one — can be solved efficiently by discretization and dynamic programming.

Discretization

We discretize time into $N$ values and assume $\phi$ is piecewise linear with knot points at these times. This gives us a vector $\tau \in \mathbb{R}^N$ where $\tau_i = \phi(t_i)$ .

We further discretize the values each $\tau_i$ can take into $M$ possibilities within bounds determined by the slope constraints.

Shortest path formulation

The discretized problem becomes a shortest path problem on a graph. We form a graph with $MN$ nodes, where each node $\tau_{ij}$ (for time index $i$ and value index $j$ ) has $M$ outgoing edges to nodes at the next time step.

At each node, we associate the state cost (loss plus cumulative regularization). On each edge, we associate the instantaneous regularization cost. The total cost of a path from $\tau_{11} = 0$ to $\tau_{NM} = 1$ is exactly our discretized objective.

Standard dynamic programming finds the globally optimal path in $O(NM^2)$ time.

Iterative refinement

After solving the discretized problem, we can reduce discretization error by iteratively refining the bounds around the current solution. At each iteration, we shrink the range $u_i - l_i$ by a fixed fraction $\eta$ (e.g., 1/2 or 1/8), centering the new bounds around the previously found $\tau_i^*$ .

This is similar in spirit to FastDTW's coarse-to-fine approach, but our method is more principled: we're iteratively solving the same continuous-time problem at higher and higher resolution.

In practice, only 3-4 iterations are needed before convergence. On a standard laptop, aligning signals with $N = 1000$ samples takes about 0.25 seconds.

ECG signal alignment

Let's see how this works on real ECG signals. We compare our method with varying amounts of instantaneous regularization ( $\lambda^{\text{inst}} \in \{0.01, 0.1, 0.5\}$ ) against FastDTW.

The difference is striking. FastDTW produces warping functions with numerous singularities — flat regions where many time points map to a single location. Our method, even with light regularization ( $\lambda^{\text{inst}} = 0.01$ ), produces noticeably smoother warps. As we increase $\lambda^{\text{inst}}$ , the warping functions become progressively smoother.

The aligned signals tell the story. With FastDTW, the P-wave, QRS complex, and T-wave of the ECG all show distortions from singularities. With our method, the morphological features of the heartbeat are preserved through the alignment process. The subtle bumps and inflections that cardiologists rely on for diagnosis remain intact.

Why smoothness matters for ECG

In cardiac signal processing, we often need to:

Build patient-specific templates by averaging multiple heartbeats
Detect arrhythmias by comparing new beats to these templates
Track changes in cardiac function over time

If singularities distort the alignment, the averaged template will be smeared and distorted. Subtle but clinically important features like notched R-waves or biphasic T-waves can be lost entirely.

By controlling smoothness through $\lambda^{\text{inst}}$ , we can tune the alignment to preserve exactly the level of morphological detail needed for downstream analysis.

Lasso and ridge intuition

The choice of regularization function has an elegant interpretation from machine learning.

Quadratic (ridge) regularization: With $R^{\text{cuml}}(u) = u^2$ or $R^{\text{inst}}(u) = u^2$ , large deviations are penalized quadratically. This discourages extreme warping but allows moderate warping everywhere.

Absolute value (lasso) regularization: With $R(u) = |u|$ , we get sparsity.

When $R^{\text{cuml}}(u) = |u|$ , we expect many times when $\tau = t$ (no cumulative warp)
When $R^{\text{inst}}(u) = |u|$ , we expect many times when $\phi'(t) = 1$ (no instantaneous warping)

This is exactly analogous to how lasso produces sparse weights in regression — the warping function will be “sparse” in its deviations from the identity.

Extensions

Group alignment and centering

Given a set of signals $x_1, \ldots, x_M$ , we can find a common target $\mu$ that all signals can be warped to:

Group alignment results showing misaligned signals (top) aligned to a common template (bottom) for both synthetic sinusoids and real ECG data — **Group alignment.** Left: Synthetic sinusoids with varying phase. Right: Real ECG heartbeats. Top: Original misaligned signals. Bottom: Signals after time-warped alignment to a common template.

\text{minimize} \sum_{i=1}^M \left[ \mathcal{L}(\phi_i; x_i, \mu) + \lambda^{\text{cuml}} \cdot \mathcal{R}^{\text{cuml}}(\phi_i) + \lambda^{\text{inst}} \cdot \mathcal{R}^{\text{inst}}(\phi_i) \right]

over both the warping functions $\phi_1, \ldots, \phi_M$ and the target $\mu$ .

This can be solved by alternating:

Fix $\mu$ , solve $M$ independent DTW problems (in parallel)
Fix the $\phi_i$ , update $\mu$ as the pointwise mean (for squared loss) or median (for absolute loss) of the warped signals

For ECG analysis, this lets us compute a time-warped mean heartbeat that captures the common morphology while accounting for heart rate variability.

Validation via train/test splits

Because our method handles non-uniformly sampled signals naturally, we can use out-of-sample validation to select hyperparameters.

We randomly split the time points into training and test sets, solve the DTW problem using only training points, and evaluate the loss on test points. This lets us detect when a warping is “overfit” — achieving low loss on training points but poor generalization.

To the best of our knowledge, this is the first use of train/test validation for model selection in DTW by partitioning samples from a single signal.

Implementation

The algorithm is implemented as the open-source Python package gdtw, with the dynamic programming core written in C++ for efficiency. The API is designed to be intuitive:

import gdtw

# Align x to y with default parameters
result = gdtw.align(x, y)

# Access the warping function
phi = result.path

# Customize regularization
result = gdtw.align(x, y, lambda^cuml=0.1, lambda^inst=0.5)

The package handles irregular sampling, supports custom loss and regularization functions, and includes utilities for group alignment and validation.

Conclusion

We've presented a general optimization framework for dynamic time warping that:

Eliminates singularities through explicit regularization of instantaneous warping rates
Unifies soft penalties and hard constraints via extended-real-valued regularization functions
Provides efficient, globally optimal solutions via discretization and dynamic programming
Enables principled model selection through out-of-sample validation on individual signals

For ECG signal processing and other applications where signal morphology matters, this framework offers a principled alternative to ad-hoc preprocessing and step patterns.

The gdtw package is available at github.com/dderiso/gdtw.

Citation

@article{deriso2026gdtw,
  author = {Dave Deriso and Stephen Boyd},
  title = {A General Optimization Framework for Dynamic Time Warping},
  year = {2026}
}