Cross-validation splitting methods¶
During cross-validation, a dataset is divided into splits for training and validation.
This can be either be done using a single basic split, or alternatively via successive folds which re-use parts of the dataset for different splits.
sklearn.model_selection provides such cross-validation splitting methods,
but does not support sequence data. Sequentia provides modified
versions of these methods to support sequence data.
API reference¶
Classes¶
K-Fold cross-validator. |
|
Stratified K-Fold cross-validator. |
|
Random permutation cross-validator. |
|
Stratified |
|
Repeated |
|
Repeated |
Example¶
Using GridSearchCV with StratifiedKFold to
cross-validate a KNNClassifier training pipeline.
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import minmax_scale
from sequentia.datasets import load_digits
from sequentia.models import KNNClassifier
from sequentia.preprocessing import IndependentFunctionTransformer
from sequentia.model_selection import StratifiedKFold, GridSearchCV
EPS: np.float32 = np.finfo(np.float32).eps
# Define model and hyper-parameter search space
search = GridSearchCV(
# Create a basic pipeline with a KNNClassifier to be optimized
estimator=Pipeline(
[
("scale", IndependentFunctionTransformer(minmax_scale)),
("clf", KNNClassifier(use_c=True, n_jobs=-1))
]
),
# Optimize over k, weighting function and window size
param_grid={
"clf__k": [1, 2, 3, 4, 5],
"clf__weighting": [
None, lambda x: 1 / (x + EPS), lambda x: np.exp(-x)
],
"clf__window": [1.0, 0.75, 0.5, 0.25, 0.1],
},
# Use StratifiedKFold cross-validation
cv=StratifiedKFold(),
n_jobs=-1,
)
# Load the spoken digit dataset with a train/test set split
data = load_digits()
train_data, test_data = data.split(test_size=0.2, stratify=True)
# Perform cross-validation over accuracy and retrieve the best model
search.fit(train_data.X, train_data.y, lengths=train_data.lengths)
clf = search.best_estimator_
# Calculate accuracy on the test set split
acc = clf.score(test_data.X, test_data.y, lengths=test_data.lengths)
Definitions¶
- class sequentia.model_selection.KFold¶
K-Fold cross-validator.
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).
Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
See also
sklearn.model_selection.KFoldKFoldis a modified version of this class that supports sequences.
- __init__(n_splits=5, *, shuffle=False, random_state=None)¶
- class sequentia.model_selection.StratifiedKFold¶
Stratified K-Fold cross-validator.
Provides train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns stratified folds.
The folds are made by preserving the percentage of samples for each class.
See also
sklearn.model_selection.StratifiedKFoldStratifiedKFoldis a modified version of this class that supports sequences.
- __init__(n_splits=5, *, shuffle=False, random_state=None)¶
- class sequentia.model_selection.ShuffleSplit¶
Random permutation cross-validator.
Yields indices to split data into training and test sets.
Note: contrary to other cross-validation strategies, random splits do not guarantee that test sets across all folds will be mutually exclusive, and might include overlapping samples. However, this is still very likely for sizeable datasets.
See also
sklearn.model_selection.ShuffleSplitShuffleSplitis a modified version of this class that supports sequences.
- __init__(n_splits=10, *, test_size=None, train_size=None, random_state=None)¶
- class sequentia.model_selection.StratifiedShuffleSplit¶
Stratified
ShuffleSplitcross-validator.Provides train/test indices to split data in train/test sets.
This cross-validation object is a merge of
StratifiedKFoldandShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.See also
sklearn.model_selection.StratifiedShuffleSplitStratifiedShuffleSplitis a modified version of this class that supports sequences.
- __init__(n_splits=10, *, test_size=None, train_size=None, random_state=None)¶
- class sequentia.model_selection.RepeatedKFold¶
Repeated
KFoldcross validator.Repeats
KFoldn times with different randomization in each repetition.See also
sklearn.model_selection.RepeatedKFoldRepeatedKFoldis a modified version of this class that supports sequences.
- __init__(*, n_splits=5, n_repeats=10, random_state=None)¶
- class sequentia.model_selection.RepeatedStratifiedKFold¶
Repeated
StratifiedKFoldcross validator.Repeats
StratifiedKFoldn times with different randomization in each repetition.See also
sklearn.model_selection.RepeatedStratifiedKFoldRepeatedStratifiedKFoldis a modified version of this class that supports sequences.
- __init__(*, n_splits=5, n_repeats=10, random_state=None)¶