# KNN Classifier¶

The KNN Classifier is a classifier that uses the $$k$$-NN algorithm with DTW as a distance measure to identify a $$k$$-neighborhood of the most similar training sequences to the sequence being classified.

To classify a sequence $$O'$$, the KNNClassifier works by:

1. Calculating the DTW distance between $$O'$$ and every training sequence.
2. Forming a k-neighborhood $$\mathcal{K}'=\left\{O^{(1)},\ldots,O^{(k)}\right\}$$ of the $$k$$ nearest training sequences to $$O'$$.
3. Calculating a distance weighting for each sequence in $$\mathcal{K}'$$.
A uniform weighting of 1 is used by default, meaning that all sequences in $$\mathcal{K}'$$ have equal influence on the predicted class. However, custom functions such as $$e^{-x}$$ (where $$x$$ is the DTW distance) can be specified to increase classification weight on training sequences that are more similar to $$O'$$.
4. Calculating a score for each of the unique classes corresponding to the sequences in $$\mathcal{K}'$$.
The score for each class is calculated as the sum of the distance weightings of all sequences in $$\mathcal{K}'$$ belonging to that class.
5. Selecting the class with the highest score.
If there is a tie between classes, a class is randomly selected between the tied classes.

## API reference¶

### Class¶

 KNNClassifier A k-nearest neighbor classifier that uses DTW as a distance measure for sequence comparison.

### Methods¶

 __init__(*[, k, weighting, window, ...]) Initializes the KNNClassifier. compute_distance_matrix(X[, lengths]) Calculates a matrix of DTW distances between the sequences in X and the training sequences. dtw(A, B, **kwargs) Calculates the DTW distance between two univariate or multivariate sequences. fit(X, y[, lengths]) Fits the classifier to the sequence(s) in X. fit_predict(X, y[, lengths]) Fits the classifier to the sequence(s) in X and predicts classes for X. load(path) Loads and deserializes a fitted KNN estimator. plot_dtw_histogram(X[, lengths, ax]) Calculates DTW distances between X and training sequences, and plots a distance histogram. plot_warping_path_1d(a, b, **kwargs) Calculates the DTW matrix between two sequences and plots the optimal warping path. plot_weight_histogram(X[, lengths, ax]) Calculates DTW weights between X and training sequences, and plots a weight histogram. predict(X[, lengths]) Predicts classes for the sequence(s) in X. predict_proba(X[, lengths]) Predicts class probabilities for the sequence(s) in X. predict_scores(X[, lengths]) Predicts class scores for the sequence(s) in X. query_neighbors(X[, lengths, sort]) Queries the k-nearest training observation sequences to each sequence in X. save(path) Serializes and saves a fitted KNN estimator. score(X, y, lengths[, normalize, sample_weight]) Calculates accuracy for the sequence(s) in X.

class sequentia.models.knn.classifier.KNNClassifier[source]

A k-nearest neighbor classifier that uses DTW as a distance measure for sequence comparison.

The classifier computes the score for each class as the total of the distance weightings of every sequence belonging to that class, within the DTW k-neighborhood of the sequence being classified.

Examples

Using a KNNClassifier to classify spoken digits.

import numpy as np
from sequentia.models.knn import KNNClassifier

# Seed for reproducible pseudo-randomness
random_state = np.random.RandomState(1)

# Fetch MFCCs of spoken digits
train_data, test_data = data.split(test_size=0.2, random_state=random_state)

# Create a HMMClassifier using a class frequency prior
clf = KNNClassifier()

# Fit the classifier
X_train, y_train, lengths_train = train_data.X_y_lengths
clf.fit(X_train, y_train, lengths_train)

# Predict classes for the test observation sequences
X_test, lengths_test = test_data.X_lengths
y_pred = clf.predict(X_test, lengths_test)

__init__(*, k=1, weighting=None, window=1, independent=False, classes=None, use_c=False, n_jobs=1, random_state=None)[source]

Initializes the KNNClassifier.

Parameters:
• k (PositiveInt) – Number of neighbors.

• weighting (Callable | None) – A callable that specifies how distance weighting should be performed. The callable should accept a numpy.ndarray of DTW distances, apply an element-wise weighting transformation to the matrix of DTW distances, then return an equally-sized numpy.ndarray of weightings. If None, then a uniform weighting of 1 will be applied to all distances.

• window (ConstrainedFloatValue) –

The size of the Sakoe—Chiba band global constrant as a fraction of the length of the shortest of the two sequences being compared.

• A larger window will give more freedom to the DTW alignment, allowing more deviation but leading to potentially slower computation. A window of 1 is equivalent to full DTW computation with no global constraint applied.

• A smaller window will restrict the DTW alignment, and possibly speed up the DTW computation. A window of 0 is equivalent to Euclidean distance.

• independent (bool) – Whether or not to allow features to be warped independently from each other. See [1] for an overview of independent and dependent dynamic time warping.

• classes (Array | None) –

Set of possible class labels.

• use_c (bool) – Whether or not to use fast pure C compiled functions from dtaidistance to perform the DTW computations.

• n_jobs (NegativeInt | PositiveInt) –

Maximum number of concurrently running workers.

• If 1, no parallelism is used at all (useful for debugging).

• If -1, all CPUs are used.

• If < -1, (n_cpus + 1 + n_jobs) are used — e.g. n_jobs=-2 uses all but one.

• random_state (NonNegativeInt | RandomState | None) – Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

Return type:

KNNClassifier

compute_distance_matrix(X, lengths=None, **kwargs)

Calculates a matrix of DTW distances between the sequences in X and the training sequences.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

• **kwargs

Model parameters to temporarily override (for experimentation purposes).

Note:

This method requires a trained classifier — see fit().

Returns:

DTW distance matrix.

Return type:

Array

dtw(A, B, **kwargs)

Calculates the DTW distance between two univariate or multivariate sequences.

Parameters:
• A (Array) – The first sequence.

• B (Array) – The second sequence.

• **kwargs

Model parameters to temporarily override (for experimentation purposes).

Returns:

DTW distance.

Return type:

float

fit(X, y, lengths=None)[source]

Fits the classifier to the sequence(s) in X.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• y (Array) – Classes corresponding to sequence(s) provided in X.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

Returns:

The fitted classifier.

Return type:

KNNClassifier

fit_predict(X, y, lengths=None)[source]

Fits the classifier to the sequence(s) in X and predicts classes for X.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• y (Array) – Classes corresponding to sequence(s) provided in X.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

Returns:

Class predictions.

Return type:

Array

Loads and deserializes a fitted KNN estimator.

Parameters:

path (str | Path | IO) – Location to load the serialized estimator from.

Returns:

Fitted KNN estimator.

save

Serializes and saves a fitted KNN estimator.

plot_dtw_histogram(X, lengths=None, ax=None, **kwargs)

Calculates DTW distances between X and training sequences, and plots a distance histogram.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

• ax (matplotlib.axes.Axes | None) – Plot axes. If None, new axes are created.

• **kwargs

Model parameters to temporarily override (for experimentation purposes).

Note:

This method requires a trained classifier — see fit().

Returns:

Plot axes.

Return type:

matplotlib.axes.Axes

plot_warping_path_1d(a, b, **kwargs)

Calculates the DTW matrix between two sequences and plots the optimal warping path.

Parameters:
• a (Array) – The first sequence.

• b (Array) – The second sequence.

• **kwargs

Model parameters to temporarily override (for experimentation purposes).

Note:

Only supports univariate sequences.

Returns:

Plot axes.

Return type:

matplotlib.axes.Axes

plot_weight_histogram(X, lengths=None, ax=None, **kwargs)

Calculates DTW weights between X and training sequences, and plots a weight histogram.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

• ax (matplotlib.axes.Axes | None) – Plot axes. If None, new axes are created.

• **kwargs

Model parameters to temporarily override (for experimentation purposes).

Note:

This method requires a trained classifier — see fit().

Returns:

Plot axes.

Return type:

matplotlib.axes.Axes

predict(X, lengths=None)[source]

Predicts classes for the sequence(s) in X.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

Note:

This method requires a trained classifier — see fit().

Returns:

Class predictions.

Return type:

Array

predict_proba(X, lengths=None)[source]

Predicts class probabilities for the sequence(s) in X.

Probabilities are calculated as normalized class scores.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

Note:

This method requires a trained classifier — see fit().

Returns:

Class membership probabilities.

Return type:

Array

predict_scores(X, lengths=None)[source]

Predicts class scores for the sequence(s) in X.

Scores are calculated as the class distance weighting sums of all training sequences in the k-neighborhood.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

Note:

This method requires a trained classifier — see fit().

Returns:

Class scores.

Return type:

Array

query_neighbors(X, lengths=None, sort=True, **kwargs)

Queries the k-nearest training observation sequences to each sequence in X.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

• sort (bool) – Whether to sort the neighbors in order of nearest to furthest.

• **kwargs

Model parameters to temporarily override (for experimentation purposes).

Returns:

K-nearest neighbors for each sequence in X.

• Indices of the k-nearest training sequences.

• DTW distances of the k-nearest training sequences.

• Corresponding outputs of the k-nearest training sequences.

Return type:

Tuple[Array, Array, Array]

save(path)

Serializes and saves a fitted KNN estimator.

Parameters:

path (str | Path | IO) – Location to save the serialized estimator.

Note:

This method requires a trained classifier — see fit().

load

Loads and deserializes a fitted KNN estimator.

score(X, y, lengths, normalize=True, sample_weight=None)[source]

Calculates accuracy for the sequence(s) in X.

Parameters:
• X (Array) –

Univariate or multivariate observation sequence(s).

• Should be a single 1D or 2D array.

• Should have length as the 1st dimension and features as the 2nd dimension.

• Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

• y (Array) – Classes corresponding to the observation sequence(s) in X.

• lengths (Array | None) –

Lengths of the observation sequence(s) provided in X.

• If None, then X is assumed to be a single observation sequence.

• len(X) should be equal to sum(lengths).

• normalize (bool) – See sklearn.metrics.accuracy_score().

• sample_weight (Array | None) – See sklearn.metrics.accuracy_score().

Note:

This method requires a trained classifier — see fit().

Returns:

Classification accuracy.

Return type:

float

classes

Set of possible class labels.

independent

Whether or not to allow features to be warped independently from each other.

k

Number of neighbors.

n_jobs

Maximum number of concurrently running workers.

random_state

Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

use_c

Whether or not to use fast pure C compiled functions from dtaidistance to perform the DTW computations.

weighting

A callable that specifies how distance weighting should be performed.

window

The size of the Sakoe—Chiba band global constrant as a fraction of the length of the shortest of the two sequences being compared.

References