Random Sequences (load_random_sequences
)
API reference
- sequentia.datasets.load_random_sequences(n_sequences, n_features, n_classes, length_range, variance_range=(2, 5), lengthscale_range=(0.2, 0.5), random_state=None, tslearn_kwargs={})[source]
Generates random sequences by sampling from a Gaussian process prior and clustering sequences to obtained labels.
Generating sequences
The GP prior is parameterised by a kernel function \(k_\theta(x, x')\). In this case, we use the squared exponential kernel which is parameterised by \(\theta=(\sigma^2, l)\).
\[k_\theta(x, x') = \sigma^2 \exp \left( - \frac{(x - x')^2}{2l^2} \right)\]\(\sigma^2\) is the variance of the kernel, which controls the height of the generated functions.
\(l\) is the lengthscale of the kernel, which controls the distance between the peaks and troughs of the generated functions.
For \(x\) values \(\mathbf{x}=(0,\ldots, n-1)\) where \(n\) is the specified length of the generated sequence, we compute a kernel matrix where \(K_\theta(\mathbf{x}, \mathbf{x})_{ij}=k_\theta(x_i, x_j)\).
We obtain function values over \(X\) by drawing samples from a multivariate normal distribution.
\[\mathbf{y} \sim \mathcal{N} \big( \mathbf{0}, K_\theta(\mathbf{x}, \mathbf{x}) \big)\]Each sequence is drawn from its own independent GP prior, with a variance and lengthscale sampled from uniform hyperpriors.
\[\sigma^2 \sim U(a_{\sigma^2},b_{\sigma^2}) \qquad l \sim U(a_l,b_l)\]Where \((a_{\sigma^2},b_{\sigma^2})\) and \((a_l,b_l)\) are given by
variance_range
andlengthscale_range
.If
n_features
is more than one, then the same \(\sigma^2\) and \(l\) values are shared along each feature of a single sequence, which means that features are often very highly correlated.Generating labels
In order to generate labels that are assigned to generated sequences that are somewhat similar, we perform clustering over these sequences using
tslearn.clustering.TimeSeriesKMeans
and use the resulting cluster labels as the label for each sequence.- Parameters
- n_sequences: int
Number of sequences to generate.
- n_features: int
Number of features in each sequences.
- n_classes: int
Number of classes.
- length_range: tuple(int, int) or int
Range of values to uniformly sample sequence lengths from.
If a single value is specified, then all sequences will have equal length.
- variance_range: tuple(float, float) or float
Lower and upper range of the uniform hyperprior on the GP prior’s variance kernel hyperparameter.
If a single value is specified, then this value is used as the variance rather than placing a hyperprior.
- lengthscale_range: tuple(float, float) or float
Lower and upper range of the uniform hyperprior on the GP prior’s lengthscale kernel hyperparameter.
If a single value is specified, then this value is used as the lengthscale rather than placing a hyperprior.
- random_state: numpy.random.RandomState, int, optional
A random state object or seed for reproducible randomness.
- tslearn_kwargs: dict
Additional key-word arguments for the
tslearn.clustering.TimeSeriesKMeans
constructor.The defaults are:
'metric'
:'dtw'
'max_iter'
:5
'max_iter_barycenter'
:5
'n_jobs'
:-1
'n_clusters'
and'random_state'
are overwritten by then_classes
andrandom_state
arguments given to this function.
- Returns
- dataset:class:sequentia.datasets.Dataset
A dataset object representing the loaded digits.