Datasets


Sequentia provides a selection of sample sequential datasets for quick experimentation.

Each dataset follows the interface described below.

API reference

Class

SequentialDataset

Utility wrapper for a generic sequential dataset.

Methods

__init__(X[, y, lengths, classes])

Initialize a SequentialDataset.

copy()

Create a copy of the dataset.

iter_by_class()

Subset the observation sequences by class.

load(path, /)

Load a stored dataset in .npz format.

save(path, /, *[, compress])

Store the dataset in .npz format.

split(*[, test_size, train_size, ...])

Split the dataset into two partitions (train/test).

Properties

X

Observation sequences.

X_lengths

Observation sequences and corresponding lengths.

X_y

Observation sequences and corresponding outputs.

X_y_lengths

Observation sequences and corresponding outputs and lengths.

classes

Set of unique classes in y.

idxs

Observation sequence start and end indices.

lengths

Lengths corresponding to X.

y

Outputs corresponding to X.


class sequentia.datasets.base.SequentialDataset

Utility wrapper for a generic sequential dataset.

__init__(X, y=None, *, lengths=None, classes=None)

Initialize a SequentialDataset.

Parameters:
  • self (SequentialDataset) –

  • X (ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]]) – Sequence(s).

  • y (ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]] | None) – Outputs corresponding to sequence(s) in X.

  • lengths (ndarray[Any, dtype[int64]] | None) –

    Lengths of the sequence(s) provided in X.

    • If None, then X is assumed to be a single sequence.

    • len(X) should be equal to sum(lengths).

  • classes (list[int] | None) –

    Set of possible class labels (only if y was provided with categorical values).

    If not provided, these will be determined from the training data labels.

Return type:

SequentialDataset

copy()

Create a copy of the dataset.

Returns:

Dataset copy.

Return type:

SequentialDataset

Parameters:

self (SequentialDataset) –

iter_by_class()

Subset the observation sequences by class.

Returns:

Generator iterating over classes, yielding:

  • X subset of sequences belonging to the class.

  • Lengths corresponding to the X subset.

  • Class used to subset X.

Return type:

Generator[tuple[numpy.ndarray, numpy.ndarray, int]]

Raises:
Parameters:

self (SequentialDataset) –

classmethod load(path, /)

Load a stored dataset in .npz format.

See numpy.load().

Parameters:

path (str | Path | IO) – Location to store the dataset.

Returns:

The loaded dataset.

Return type:

SequentialDataset

See also

save

Stores the dataset in .npz format.

save(path, /, *, compress=True)

Store the dataset in .npz format.

See numpy.savez() and numpy.savez_compressed().

Parameters:
Return type:

None

See also

load

Loads a stored dataset in .npz format.

split(*, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=False)

Split the dataset into two partitions (train/test).

See sklearn.model_selection.train_test_split().

Parameters:
  • self (SequentialDataset) –

  • test_size (int | float | None) – Size of the test partition.

  • train_size (int | float | None) – Size of the training partition.

  • random_state (int | RandomState | None) – Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

  • shuffle (bool) – Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be False.

  • stratify (bool) – Whether or not to stratify the partitions by class label.

Returns:

Dataset partitions.

Return type:

tuple[SequentialDataset, SequentialDataset]

property X: ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]]

Observation sequences.

Returns:

Observation sequences.

Return type:

numpy.ndarray

property X_lengths: dict[str, ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]]]

Observation sequences and corresponding lengths.

Returns:

Mapping with keys:

  • "X" for observation sequences,

  • "lengths" for lengths.

Return type:

dict[str, numpy.ndarray]

property X_y: dict[str, ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]]]

Observation sequences and corresponding outputs.

Returns:

Mapping with keys:

  • "X" for observation sequences,

  • "y" for outputs.

Return type:

dict[str, numpy.ndarray]

Raises:

AttributeError – If y was not provided to __init__().

property X_y_lengths: dict[str, ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]]]

Observation sequences and corresponding outputs and lengths.

Returns:

Mapping with keys:

  • "X" for observation sequences,

  • "y" for outputs,

  • "lengths" for lengths.

Return type:

dict[str, numpy.ndarray]

Raises:

AttributeError – If y was not provided to __init__().

property classes: ndarray[Any, dtype[int64]] | None

Set of unique classes in y.

Returns:

Unique classes if y is categorical.

Return type:

numpy.ndarray | None

property idxs: ndarray[Any, dtype[int64]]

Observation sequence start and end indices.

Returns:

Start and end indices for each sequence in X.

Return type:

numpy.ndarray

property lengths: ndarray[Any, dtype[int64]]

Lengths corresponding to X.

Returns:

Lengths for each sequence in X.

Return type:

numpy.ndarray

property y: ndarray[Any, dtype[float64]] | ndarray[Any, dtype[int64]]

Outputs corresponding to X.

Returns:

Sequence outputs.

Return type:

numpy.ndarray

Raises:

AttributeError – If y was not provided to __init__().