Random Sequences (load_random_sequences)

API reference

sequentia.datasets.load_random_sequences(n_sequences, n_features, n_classes, length_range, variance_range=(2, 5), lengthscale_range=(0.2, 0.5), random_state=None, tslearn_kwargs={})[source]

Generates random sequences by sampling from a Gaussian process prior and clustering sequences to obtained labels.

Generating sequences

The GP prior is parameterised by a kernel function \(k_\theta(x, x')\). In this case, we use the squared exponential kernel which is parameterised by \(\theta=(\sigma^2, l)\).

\[k_\theta(x, x') = \sigma^2 \exp \left( - \frac{(x - x')^2}{2l^2} \right)\]
  • \(\sigma^2\) is the variance of the kernel, which controls the height of the generated functions.

  • \(l\) is the lengthscale of the kernel, which controls the distance between the peaks and troughs of the generated functions.

For \(x\) values \(\mathbf{x}=(0,\ldots, n-1)\) where \(n\) is the specified length of the generated sequence, we compute a kernel matrix where \(K_\theta(\mathbf{x}, \mathbf{x})_{ij}=k_\theta(x_i, x_j)\).

We obtain function values over \(X\) by drawing samples from a multivariate normal distribution.

\[\mathbf{y} \sim \mathcal{N} \big( \mathbf{0}, K_\theta(\mathbf{x}, \mathbf{x}) \big)\]

Each sequence is drawn from its own independent GP prior, with a variance and lengthscale sampled from uniform hyperpriors.

\[\sigma^2 \sim U(a_{\sigma^2},b_{\sigma^2}) \qquad l \sim U(a_l,b_l)\]

Where \((a_{\sigma^2},b_{\sigma^2})\) and \((a_l,b_l)\) are given by variance_range and lengthscale_range.

If n_features is more than one, then the same \(\sigma^2\) and \(l\) values are shared along each feature of a single sequence, which means that features are often very highly correlated.

Generating labels

In order to generate labels that are assigned to generated sequences that are somewhat similar, we perform clustering over these sequences using tslearn.clustering.TimeSeriesKMeans and use the resulting cluster labels as the label for each sequence.

Parameters
n_sequences: int

Number of sequences to generate.

n_features: int

Number of features in each sequences.

n_classes: int

Number of classes.

length_range: tuple(int, int) or int

Range of values to uniformly sample sequence lengths from.

If a single value is specified, then all sequences will have equal length.

variance_range: tuple(float, float) or float

Lower and upper range of the uniform hyperprior on the GP prior’s variance kernel hyperparameter.

If a single value is specified, then this value is used as the variance rather than placing a hyperprior.

lengthscale_range: tuple(float, float) or float

Lower and upper range of the uniform hyperprior on the GP prior’s lengthscale kernel hyperparameter.

If a single value is specified, then this value is used as the lengthscale rather than placing a hyperprior.

random_state: numpy.random.RandomState, int, optional

A random state object or seed for reproducible randomness.

tslearn_kwargs: dict

Additional key-word arguments for the tslearn.clustering.TimeSeriesKMeans constructor.

The defaults are:

  • 'metric': 'dtw'

  • 'max_iter': 5

  • 'max_iter_barycenter': 5

  • 'n_jobs': -1

'n_clusters' and 'random_state' are overwritten by the n_classes and random_state arguments given to this function.

Returns
dataset:class:sequentia.datasets.Dataset

A dataset object representing the loaded digits.