Pipeline

Before fitting and using a model, it is common to apply a sequence of preprocessing steps to data.

Pipelines can be used to wrap preprocessing transformations as well as a model into a single estimator, making it more convenient to reapply the transformations and make predictions on new data.

The Pipeline class implements this feature and is based on sklearn.pipeline.Pipeline.

API reference

Class

Pipeline

Pipeline of transforms with a final estimator.

Methods

__init__(steps, *[, memory, verbose])

Initializes the Pipeline.

fit(X[, y, lengths])

Fit the model.

fit_predict(X, y[, lengths])

Transform the data, and apply fit_predict with the final estimator.

fit_transform(X[, lengths])

Fit the model and transform with the final estimator.

inverse_transform(X[, lengths])

Apply inverse_transform for each step in a reverse order.

predict(X[, lengths])

Transform the data, and apply predict with the final estimator.

predict_proba(X[, lengths])

Transform the data, and apply predict_proba with the final estimator.

score(X[, y, lengths, sample_weight])

Transform the data, and apply score with the final estimator.

transform(X[, lengths])

Transform the data, and apply transform with the final estimator.


class sequentia.pipeline.Pipeline[source]

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a __. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to 'passthrough' or None.

See also

sklearn.pipeline.Pipeline

Pipeline is based on sklearn.pipeline.Pipeline, but adapted to accept and work with sequences. Read more in the User Guide.

Examples

Creating a Pipeline consisting of two transforms and a KNNClassifier, and fitting it to sequences in the spoken digits dataset.

from sequentia.models import KNNClassifier
from sequentia.preprocessing import IndependentFunctionTransformer
from sequentia.pipeline import Pipeline
from sequentia.datasets import load_digits

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA

# Fetch MFCCs of spoken digits
digits = load_digits()
train, test = digits.split(test_size=0.2)

# Create a pipeline with two transforms and a classifier
pipeline = Pipeline([
    ('standardize', IndependentFunctionTransformer(scale)),
    ('pca', PCA(n_components=5)),
    ('clf', KNNClassifier(k=1))
])

# Fit the pipeline transforms and classifier to training data
pipeline.fit(train.X, train.lengths)

# Apply the transforms to training sequences and make predictions
y_train_pred = pipeline.predict(train.X, train.y, train.lengths)

# Calculate accuracy on test data
acc = pipeline.score(test.X, test.y, test.lengths)
__init__(steps, *, memory=None, verbose=False)[source]

Initializes the Pipeline.

Parameters:
  • steps (List[Tuple[str, BaseEstimator]]) – Collection of transforms implementing fit/transform that are chained, with the last object being an estimator.

  • memory (str | Memory | None) – Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.

  • verbose (bool) – If True, the time elapsed while fitting each step will be printed as it is completed.

Return type:

Pipeline

fit(X, y=None, lengths=None, **fit_params)[source]

Fit the model.

Fit all the transformers one after the other and transform the data. Finally, fit the transformed data using the final estimator.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • y (Array | None) – Outputs corresponding to sequence(s) provided in X. Only required if the final estimator is a supervised model.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

  • fit_params – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns:

The fitted pipeline.

Return type:

Pipeline

fit_predict(X, y, lengths=None, **fit_params)[source]

Transform the data, and apply fit_predict with the final estimator.

Call fit_transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls fit_predict method. Only valid if the final estimator implements fit_predict.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • y (Array) – Outputs corresponding to sequence(s) provided in X.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

  • fit_params – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns:

Output predictions.

Return type:

Array

fit_transform(X, lengths=None, **fit_params)[source]

Fit the model and transform with the final estimator.

Fits all the transformers one after the other and transform the data. Then uses fit_transform on transformed data with the final estimator.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

  • fit_params – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns:

The transformed data.

Return type:

Array

inverse_transform(X, lengths=None)[source]

Apply inverse_transform for each step in a reverse order.

All estimators in the pipeline must support inverse_transform.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

Returns:

The inverse transformed data.

Return type:

Array

predict(X, lengths=None)[source]

Transform the data, and apply predict with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls predict method. Only valid if the final estimator implements predict.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

Returns:

Output predictions.

Return type:

Array

predict_proba(X, lengths=None)[source]

Transform the data, and apply predict_proba with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls predict_proba method. Only valid if the final estimator implements predict_proba.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

Returns:

Output probabilities.

Return type:

Array

score(X, y=None, lengths=None, sample_weight=None)[source]

Transform the data, and apply score with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls score method. Only valid if the final estimator implements score.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • y (Array | None) – Outputs corresponding to sequence(s) provided in X. Must be provided if the final estimator is a model, i.e. not a transform.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

  • sample_weight (Any | None) – If not None, this argument is passed as sample_weight keyword argument to the score method of the final estimator.

Returns:

Result of calling score on the final estimator.

Return type:

float

transform(X, lengths=None)[source]

Transform the data, and apply transform with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls transform method. Only valid if the final estimator implements transform.

This also works where final estimator is None in which case all prior transformations are applied.

Parameters:
  • X (Array) –

    Univariate or multivariate observation sequence(s).

    • Should be a single 1D or 2D array.

    • Should have length as the 1st dimension and features as the 2nd dimension.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

Returns:

The transformed data.

Return type:

Array

property classes_

The classes labels. Only exist if the last step is a classifier.

property n_features_in_

Number of features seen during first step fit method.

property named_steps

Access the steps by name.

Read-only attribute to access any step by given name. Keys are steps names and values are the steps objects.