KNN Classifier

The KNN Classifier is a classifier that uses the \(k\)-NN algorithm with DTW as a distance measure to identify a \(k\)-neighborhood of the most similar training sequences to the sequence being classified.

To classify a sequence \(O'\), the KNNClassifier works by:

  1. Calculating the DTW distance between \(O'\) and every training sequence.
  2. Forming a k-neighborhood \(\mathcal{K}'=\left\{O^{(1)},\ldots,O^{(k)}\right\}\) of the \(k\) nearest training sequences to \(O'\).
  3. Calculating a distance weighting for each sequence in \(\mathcal{K}'\).
    A uniform weighting of 1 is used by default, meaning that all sequences in \(\mathcal{K}'\) have equal influence on the predicted class. However, custom functions such as \(e^{-x}\) (where \(x\) is the DTW distance) can be specified to increase classification weight on training sequences that are more similar to \(O'\).
  4. Calculating a score for each of the unique classes corresponding to the sequences in \(\mathcal{K}'\).
    The score for each class is calculated as the sum of the distance weightings of all sequences in \(\mathcal{K}'\) belonging to that class.
  5. Selecting the class with the highest score.
    If there is a tie between classes, a class is randomly selected between the tied classes.

API reference

Class

KNNClassifier

A k-nearest neighbor classifier that uses DTW as a distance measure for sequence comparison.

Methods

__init__(*[, k, weighting, window, ...])

Initializes the KNNClassifier.

compute_distance_matrix(X, *[, lengths])

Calculate a matrix of DTW distances between the sequences in X and the training sequences.

dtw(A, B)

Calculate the DTW distance between two observation sequences.

fit(X, y, *[, lengths])

Fit the classifier to the sequence(s) in X.

fit_predict(X, y, *[, lengths])

Fit the model to the sequence(s) in X and predicts outputs for X.

load(path, /)

Load and deserialize a fitted KNN estimator.

predict(X, *[, lengths])

Predict classes for the sequence(s) in X.

predict_log_proba(X, *[, lengths])

Predict log class probabilities for the sequence(s) in X.

predict_proba(X, *[, lengths])

Predict class probabilities for the sequence(s) in X.

predict_scores(X, *[, lengths])

Predict class scores for the sequence(s) in X.

query_neighbors(X, *[, lengths, sort])

Query the k-nearest training observation sequences to each sequence in X.

save(path, /)

Serialize and save a fitted KNN estimator.

score(X, y, *[, lengths, normalize, ...])

Calculate the predictive accuracy for the sequence(s) in X.


class sequentia.models.knn.classifier.KNNClassifier

A k-nearest neighbor classifier that uses DTW as a distance measure for sequence comparison.

The classifier computes the score for each class as the total of the distance weightings of every sequence belonging to that class, within the DTW k-neighborhood of the sequence being classified.

Examples

Using a KNNClassifier to classify spoken digits.

import numpy as np
from sequentia.datasets import load_digits
from sequentia.models.knn import KNNClassifier

# Seed for reproducible pseudo-randomness
random_state = np.random.RandomState(1)

# Fetch MFCCs of spoken digits
data = load_digits()
train_data, test_data = data.split(test_size=0.2, random_state=random_state)

# Create a HMMClassifier using a class frequency prior
clf = KNNClassifier()

# Fit the classifier
clf.fit(train_data.X, train_data.y, lengths=train_data.lengths)

# Predict classes for the test observation sequences
y_pred = clf.predict(test_data.X, lengths=test_data.lengths)
__init__(*, k=1, weighting=None, window=1.0, independent=False, use_c=False, n_jobs=1, random_state=None, classes=None)

Initializes the KNNClassifier.

Parameters:
  • self (KNNClassifier) –

  • k (int) – Number of neighbors.

  • weighting (Callable[[ndarray[Any, dtype[float64]]], ndarray[Any, dtype[float64]]] | None) –

    A callable that specifies how distance weighting should be performed.

    The callable should accept a numpy.ndarray of DTW distances, apply an element-wise weighting transformation to the matrix of DTW distances, then return an equally-sized numpy.ndarray of weightings.

    If None, then a uniform weighting of 1 will be applied to all distances.

  • window (float) –

    The size of the Sakoe—Chiba band global constrant as a fraction of the length of the shortest of the two sequences being compared.

    • A larger window will give more freedom to the DTW alignment, allowing more deviation but leading to potentially slower computation. A window of 1 is equivalent to full DTW computation with no global constraint applied.

    • A smaller window will restrict the DTW alignment, and possibly speed up the DTW computation. A window of 0 is equivalent to Euclidean distance.

  • independent (bool) – Whether or not to allow features to be warped independently from each other. See [1] for an overview of independent and dependent dynamic time warping.

  • use_c (bool) – Whether or not to use fast pure C compiled functions from dtaidistance to perform the DTW computations.

  • n_jobs (int) –

    Maximum number of concurrently running workers.

    • If 1, no parallelism is used at all (useful for debugging).

    • If -1, all CPUs are used.

    • If < -1, (n_cpus + 1 + n_jobs) are used — e.g. n_jobs=-2 uses all but one.

  • random_state (int | RandomState | None) – Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

  • classes (list[int] | None) –

    Set of possible class labels.

    • If not provided, these will be determined from the training data labels.

    • If provided, output from methods such as predict_proba() and predict_scores() will follow the ordering of the class labels provided here.

Return type:

KNNClassifier

compute_distance_matrix(X, *, lengths=None)

Calculate a matrix of DTW distances between the sequences in X and the training sequences.

Parameters:
  • self (KNNMixin) –

  • X (ndarray[Any, dtype[float64]]) – Sequence(s).

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

Returns:

DTW distance matrix.

Return type:

numpy.ndarray

Notes

This method requires a trained model — see fit().

dtw(A, B)

Calculate the DTW distance between two observation sequences.

Parameters:
Returns:

DTW distance.

Return type:

numpy.ndarray

Notes

This method requires a trained model — see fit().

fit(X, y, *, lengths=None)

Fit the classifier to the sequence(s) in X.

Parameters:
  • self (KNNClassifier) –

  • X (ndarray[Any, dtype[float64]]) – Sequence(s).

  • y (ndarray[Any, dtype[int64]]) – Classes corresponding to sequence(s) in X.

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

Returns:

The fitted classifier.

Return type:

KNNClassifier

fit_predict(X, y, *, lengths=None)

Fit the model to the sequence(s) in X and predicts outputs for X.

Parameters:
  • self (ClassifierMixin) –

  • X (ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]]) – Sequence(s).

  • y (ndarray[Any, dtype[int64]]) – Outputs corresponding to sequence(s) in X.

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

Returns:

Output predictions.

Return type:

numpy.ndarray

classmethod load(path, /)

Load and deserialize a fitted KNN estimator.

Parameters:
  • cls (type[KNNMixin]) –

  • path (str | Path | IO) – Location to load the serialized estimator from.

Returns:

Fitted KNN estimator.

Return type:

KNNMixin

See also

save

Serialize and save a fitted KNN estimator.

predict(X, *, lengths=None)

Predict classes for the sequence(s) in X.

Parameters:
  • self (KNNClassifier) –

  • X (ndarray[Any, dtype[float64]]) – Sequence(s).

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

Returns:

Class predictions.

Return type:

numpy.ndarray

Notes

This method requires a trained classifier — see fit().

predict_log_proba(X, *, lengths=None)

Predict log class probabilities for the sequence(s) in X.

Probabilities are calculated as normalized class scores.

Parameters:
  • self (KNNClassifier) –

  • X (ndarray[Any, dtype[float64]]) – Sequence(s).

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

Returns:

Class membership log-probabilities.

Return type:

numpy.ndarray

Notes

This method requires a trained classifier — see fit().

predict_proba(X, *, lengths=None)

Predict class probabilities for the sequence(s) in X.

Probabilities are calculated as normalized class scores.

Parameters:
  • self (KNNClassifier) –

  • X (ndarray[Any, dtype[float64]]) – Sequence(s).

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

Returns:

Class membership probabilities.

Return type:

numpy.ndarray

Notes

This method requires a trained classifier — see fit().

predict_scores(X, *, lengths=None)

Predict class scores for the sequence(s) in X.

Scores are calculated as the class distance weighting sums of all training sequences in the k-neighborhood.

Parameters:
  • self (KNNClassifier) –

  • X (ndarray[Any, dtype[float64]]) – Sequence(s).

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

Returns:

Class scores.

Return type:

numpy.ndarray

Notes

This method requires a trained classifier — see fit().

query_neighbors(X, *, lengths=None, sort=True)

Query the k-nearest training observation sequences to each sequence in X.

Parameters:
  • self (KNNMixin) –

  • X (ndarray[Any, dtype[float64]]) – Sequence(s).

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

  • sort (bool) – Whether to sort the neighbors in order of nearest to furthest.

Returns:

K-nearest neighbors for each sequence in X.

  • Indices of the k-nearest training sequences.

  • DTW distances of the k-nearest training sequences.

  • Corresponding outputs of the k-nearest training sequences.

Return type:

tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Notes

This method requires a trained model — see fit().

save(path, /)

Serialize and save a fitted KNN estimator.

Parameters:
  • self (KNNMixin) –

  • path (str | Path | IO) – Location to save the serialized estimator.

Return type:

None

Notes

This method requires a trained model — see fit().

See also

load

Load and deserialize a fitted KNN estimator.

score(X, y, *, lengths=None, normalize=True, sample_weight=None)

Calculate the predictive accuracy for the sequence(s) in X.

Parameters:
Returns:

Predictive accuracy.

Return type:

float

Notes

This method requires a trained classifier — see fit().

classes: list[int] | None

Set of possible class labels.

independent: bool

Whether or not to allow features to be warped independently from each other.

k: int

Number of neighbors.

n_jobs: int

Maximum number of concurrently running workers.

random_state: int | RandomState | None

Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

use_c: bool

Whether or not to use fast pure C compiled functions from dtaidistance to perform the DTW computations.

weighting: Callable[[ndarray], ndarray] | None

A callable that specifies how distance weighting should be performed.

window: float

The size of the Sakoe—Chiba band global constrant as a fraction of the length of the shortest of the two sequences being compared.

References