Datasets


Sequentia provides a selection of sample sequential datasets for quick experimentation.

Each dataset follows the interface described below.

API reference

Class

SequentialDataset

Utility wrapper for a generic sequential dataset.

Methods

__init__(X[, y, lengths, classes])

Initializes a SequentialDataset.

copy()

Creates a copy of the dataset.

iter_by_class()

Subsets the observation sequences by class.

load(path)

Loads a stored dataset in .npz format.

save(path[, compress])

Stores the dataset in .npz format.

split([test_size, train_size, random_state, ...])

Splits the dataset into two partitions (train/test).

Properties

X

Observation sequences.

X_lengths

Observation sequences and corresponding lengths.

X_y

Observation sequences and corresponding outputs.

X_y_lengths

Observation sequences and corresponding outputs and lengths.

classes

Set of unique classes in y.

idxs

Observation sequence start and end indices.

lengths

Lengths corresponding to X.

y

Outputs corresponding to X.


class sequentia.utils.SequentialDataset[source]

Utility wrapper for a generic sequential dataset.

__init__(X, y=None, lengths=None, classes=None)[source]

Initializes a SequentialDataset.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • y (Optional[Array]) – Outputs corresponding to sequence(s) provided in X.

  • lengths (Optional[Array]) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

  • classes (Optional[Array]) –

    Set of possible class labels (only if y was provided with categorical values).

    • If not provided, these will be determined from the training data labels.

Return type:

SequentialDataset

copy()[source]

Creates a copy of the dataset.

Returns:

Dataset copy.

Return type:

SequentialDataset

iter_by_class()[source]

Subsets the observation sequences by class.

Raises:

AttributeError - If y was not provided to __init__(), or is not categorical.

Returns:

Generator iterating over classes, yielding:

  • X subset of sequences belonging to the class.

  • Lengths corresponding to the X subset.

  • Class used to subset X.

Return type:

Iterator[Tuple[Array, Array, int]]

classmethod load(path)[source]

Loads a stored dataset in .npz format.

See numpy.load().

Parameters:

path (Union[str, Path, IO]) – Location to store the dataset.

Returns:

The loaded dataset.

Return type:

SequentialDataset

See also

save

Stores the dataset in .npz format.

save(path, compress=True)[source]

Stores the dataset in .npz format.

See numpy.savez() and numpy.savez_compressed().

Parameters:
  • path (Union[str, Path, IO]) – Location to store the dataset.

  • compress (bool) – Whether or not to compress the dataset.

See also

load

Loads a stored dataset in .npz format.

split(test_size=None, train_size=None, random_state=None, shuffle=True, stratify=False)[source]

Splits the dataset into two partitions (train/test).

See sklearn.model_selection.train_test_split().

Parameters:
  • test_size (Optional[Union[NonNegativeInt, ConstrainedFloatValue]]) – Size of the test partition.

  • train_size (Optional[Union[NonNegativeInt, ConstrainedFloatValue]]) – Size of the train partition.

  • random_state (Optional[Union[NonNegativeInt, RandomState]]) – Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

  • shuffle (bool) – Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be False.

  • stratify (bool) – Whether or not to stratify the partitions by class labels.

Returns:

Dataset partitions.

Return type:

Tuple[SequentialDataset, SequentialDataset]

property X: Array

Observation sequences.

property X_lengths: Tuple[Array, Array]

Observation sequences and corresponding lengths.

property X_y: Tuple[Array, Array]

Observation sequences and corresponding outputs.

Raises:

AttributeError - If y was not provided to __init__().

property X_y_lengths: Tuple[Array, Array, Array]

Observation sequences and corresponding outputs and lengths.

Raises:

AttributeError - If y was not provided to __init__().

property classes: Optional[Array]

Set of unique classes in y. If y is not categorical, then None.

property idxs: Array

Observation sequence start and end indices.

property lengths: Array

Lengths corresponding to X.

property y: Array

Outputs corresponding to X.

Raises:

AttributeError - If y was not provided to __init__().