Cross-validation splitting methods

During cross-validation, a dataset is divided into splits for training and validation.

This can be either be done using a single basic split, or alternatively via successive folds which re-use parts of the dataset for different splits.

sklearn.model_selection provides such cross-validation splitting methods, but does not support sequence data. Sequentia provides modified versions of these methods to support sequence data.

API reference

Classes

KFold

K-Fold cross-validator.

StratifiedKFold

Stratified K-Fold cross-validator.

ShuffleSplit

Random permutation cross-validator.

StratifiedShuffleSplit

Stratified ShuffleSplit cross-validator.

RepeatedKFold

Repeated KFold cross validator.

RepeatedStratifiedKFold

Repeated StratifiedKFold cross validator.

Example

Using GridSearchCV with StratifiedKFold to cross-validate a KNNClassifier training pipeline.

import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import minmax_scale

from sequentia.datasets import load_digits
from sequentia.models import KNNClassifier
from sequentia.preprocessing import IndependentFunctionTransformer
from sequentia.model_selection import StratifiedKFold, GridSearchCV

EPS: np.float32 = np.finfo(np.float32).eps

# Define model and hyper-parameter search space
search = GridSearchCV(
    # Create a basic pipeline with a KNNClassifier to be optimized
    estimator=Pipeline(
        [
            ("scale", IndependentFunctionTransformer(minmax_scale)),
            ("clf", KNNClassifier(use_c=True, n_jobs=-1))
        ]
    ),
    # Optimize over k, weighting function and window size
    param_grid={
        "clf__k": [1, 2, 3, 4, 5],
        "clf__weighting": [
            None, lambda x: 1 / (x + EPS), lambda x: np.exp(-x)
        ],
        "clf__window": [1.0, 0.75, 0.5, 0.25, 0.1],
    },
    # Use StratifiedKFold cross-validation
    cv=StratifiedKFold(),
    n_jobs=-1,
)

# Load the spoken digit dataset with a train/test set split
data = load_digits()
train_data, test_data = data.split(test_size=0.2, stratify=True)

# Perform cross-validation over accuracy and retrieve the best model
search.fit(train_data.X, train_data.y, lengths=train_data.lengths)
clf = search.best_estimator_

# Calculate accuracy on the test set split
acc = clf.score(test_data.X, test_data.y, lengths=test_data.lengths)

Definitions

class sequentia.model_selection.KFold

K-Fold cross-validator.

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

See also

sklearn.model_selection.KFold

KFold is a modified version of this class that supports sequences.

__init__(n_splits=5, *, shuffle=False, random_state=None)
class sequentia.model_selection.StratifiedKFold

Stratified K-Fold cross-validator.

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds.

The folds are made by preserving the percentage of samples for each class.

See also

sklearn.model_selection.StratifiedKFold

StratifiedKFold is a modified version of this class that supports sequences.

__init__(n_splits=5, *, shuffle=False, random_state=None)
class sequentia.model_selection.ShuffleSplit

Random permutation cross-validator.

Yields indices to split data into training and test sets.

Note: contrary to other cross-validation strategies, random splits do not guarantee that test sets across all folds will be mutually exclusive, and might include overlapping samples. However, this is still very likely for sizeable datasets.

See also

sklearn.model_selection.ShuffleSplit

ShuffleSplit is a modified version of this class that supports sequences.

__init__(n_splits=10, *, test_size=None, train_size=None, random_state=None)
class sequentia.model_selection.StratifiedShuffleSplit

Stratified ShuffleSplit cross-validator.

Provides train/test indices to split data in train/test sets.

This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

See also

sklearn.model_selection.StratifiedShuffleSplit

StratifiedShuffleSplit is a modified version of this class that supports sequences.

__init__(n_splits=10, *, test_size=None, train_size=None, random_state=None)
class sequentia.model_selection.RepeatedKFold

Repeated KFold cross validator.

Repeats KFold n times with different randomization in each repetition.

See also

sklearn.model_selection.RepeatedKFold

RepeatedKFold is a modified version of this class that supports sequences.

__init__(*, n_splits=5, n_repeats=10, random_state=None)
class sequentia.model_selection.RepeatedStratifiedKFold

Repeated StratifiedKFold cross validator.

Repeats StratifiedKFold n times with different randomization in each repetition.

See also

sklearn.model_selection.RepeatedStratifiedKFold

RepeatedStratifiedKFold is a modified version of this class that supports sequences.

__init__(*, n_splits=5, n_repeats=10, random_state=None)