Dynamic Time Warping k-Nearest Neighbors Classifier (DTWKNN)

Recall that the isolated sequences we are dealing with are represented as multivariate time series of different durations.
Suppose that our sequences are all \(D\)-dimensional. The main requirement of k-Nearest Neighbor (\(k\)-NN) classifiers is that each example must have the same number of dimensions – and hence, be in the same feature space. This is indeed the case with our \(D\)-dimensional sequences. However, we can’t use \(k\)-NN with simple distance metrics such as Euclidean distance because we are comparing sequences (which represent an ordered collection of points in \(D\)-dimensional space) rather than individual points in \(D\)-dimensional space.

One distance metric that allows us to compare multivariate sequences of different length is Dynamic Time Warping. Coupling this metric with \(k\)-NN creates a powerful classifier that assigns the class of a new observation sequence by looking at the classes of observation sequences with similar patterns.

However, \(k\)-NN classifiers suffer from the fact that they are non-parametric, which means that when predicting the class for a new observation sequence, we must look back at every observation sequence that was used to fit the model. To speed up prediction times, we have chosen to use a constrained DTW algorithm that sacrifices accuracy by calculating an approximate distance, but saves a lot of time. This is the FastDTW implementation, which has a radius parameter for controlling the imposed constraint on the distance calculation.

This approximate DTW \(k\)-NN classifier is implemented by the DTWKNN class.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
from sequentia.classifiers import DTWKNN

# Create some sample data
X = [np.random.random((10 * i, 3)) for i in range(1, 4)]
y = ['class0', 'class1', 'class1']

# Create and fit the classifier
clf = DTWKNN(k=1, radius=5)
clf.fit(X, y)

# Predict labels for the training data (just as an example)
clf.predict(X)

For more elaborate examples, please have a look at the example notebooks.

API reference

class sequentia.classifiers.dtwknn.DTWKNN(k, radius, metric=<function euclidean>)[source]

A k-Nearest Neighbor classifier that compares differing length observation sequences using the efficient FastDTW dynamic time warping algorithm.

Parameters:
k: int

Number of neighbors.

radius: int

Radius parameter for FastDTW.

See: Stan Salvador, and Philip Chan. “FastDTW: Toward accurate dynamic time warping in linear time and space.” Intelligent Data Analysis 11.5 (2007), 561-580.

metric: callable

Distance metric for FastDTW.

fit(self, X, y)[source]

Fits the classifier by adding labeled training observation sequences.

Parameters:
X: List[numpy.ndarray]

A list of multiple observation sequences.

y: List[str]

A list of labels for the observation sequences.

predict(self, X, verbose=True, n_jobs=1)[source]

Predicts the label for an observation sequence (or multiple sequences).

Parameters:
X: numpy.ndarray or List[numpy.ndarray]

An individual observation sequence or a list of multiple observation sequences.

verbose: bool

Whether to display a progress bar or not.

n_jobs: int
The number of jobs to run in parallel.
Setting this to -1 will use all available CPU cores.
Returns:
prediction(s): str or List[str]

The predicted label(s) for the observation sequence(s).

evaluate(self, X, y, labels=None, verbose=True, n_jobs=1)[source]

Evaluates the performance of the classifier on a batch of observation sequences and their labels.

Parameters:
X: List[numpy.ndarray]

A list of multiple observation sequences.

y: List[str]

A list of labels for the observation sequences.

labels: List[str]

A list of labels for ordering the axes of the confusion matrix.

verbose: bool

Whether to display a progress bar for predictions or not.

n_jobs: int
The number of jobs to run in parallel.
Setting this to -1 will use all available CPU cores.
Returns:
accuracy: float

The categorical accuracy of the classifier on the observation sequences.

confusion: numpy.ndarray

The confusion matrix representing the discrepancy between predicted and actual labels.