Categorical HMM

The Categorical HMM is a variant of HMM that uses a discrete probability distribution over a finite set of symbols as the emission distribution for each state.

This HMM variant can be used to recognize categorical univariate sequences.

Emissions

The emission distribution \(b_m\) of an observation \(o^{(t)}\) at time \(t\) for state \(m\) is given by a probability vector:

\[\bigg[\underbrace{\mathbb{P}\big(o^{(t)}=s_0\ |\ q^{(t)}=m\big)}_{p_{m,0}}, \ldots, \underbrace{\mathbb{P}\big(o^{(t)}=s_K\ |\ q^{(t)}=m\big)}_{p_{m,K}}\bigg]\]

Where:

  • \(\mathcal{S}=\{s_0,s_1,\ldots,s_K\}\) is a finite set of observation symbols.
  • \(o^{(t)}\in\mathcal{S}\) is a single observation at time \(t\).
  • \(q^{(t)}\) is a discrete random variable representing the hidden state at time \(t\).
  • \(p_{m,k}=\mathbb{P}\big(o^{(t)}=s_k\ |\ q^{(t)}=m\big)\) is the probability of observing \(s_k\) while in state \(m\).

The emission distributions for all states can be represented by a single \(M\times K\) emission matrix:

\[\begin{split}\begin{bmatrix} p_{0,0} & \cdots & p_{0,K} \\ \vdots & \ddots & \vdots \\ p_{M,0} & \cdots & p_{M,K} \end{bmatrix}\end{split}\]

Note

Observation symbols must be encoded as integers. Consider performing label encoding using sklearn.preprocessing.LabelEncoder.

API reference

Class

CategoricalHMM

A hidden Markov model with univariate categorical emissions.

Methods

__init__(*[, n_states, topology, ...])

Initializes the CategoricalHMM.

aic(X[, lengths])

The Akaike information criterion of the model, evaluated with the maximum likelihood of X.

bic(X[, lengths])

The Bayesian information criterion of the model, evaluated with the maximum likelihood of X.

fit(X[, lengths])

Fits the HMM to the sequences in X, using the Baum—Welch algorithm.

freeze([params])

Freezes the trainable parameters of the HMM, preventing them from being updated during the Baum—Welch algorithm.

n_params()

Retrieves the number of trainable parameters.

score(x)

Calculates the log-likelihood of the HMM generating a single observation sequence.

set_start_probs([values])

Sets the initial state probabilities.

set_state_emissions(values)

Sets the state emission distribution of the HMM's emission model.

set_transitions([values])

Sets the transition probability matrix.

unfreeze([params])

Unfreezes the trainable parameters of the HMM, allowing them to be updated during the Baum—Welch algorithm.


class sequentia.models.hmm.variants.CategoricalHMM[source]

A hidden Markov model with univariate categorical emissions.

Examples

Using a CategoricalHMM to learn how to recognize DNA sequences from the synthetase gene family.

See load_gene_families() for more information on the sample dataset used in this example.

import numpy as np
from sequentia.datasets import load_gene_families
from sequentia.models.hmm import CategoricalHMM

# Seed for reproducible pseudo-randomness
random_state = np.random.RandomState(1)

# Fetch DNA sequences for the synthetase gene family (no. 4)
data, enc = load_gene_families(families=[4])
train_data, test_data = data.split(test_size=0.2, random_state=random_state)

# Create and train a CategoricalHMM to recognize the synthetase DNA sequences
model = CategoricalHMM(random_state=random_state)
X_train, lengths_train = train_data.X_lengths
model.fit(X_train, lengths_train)

# Calculate the log-likelihood of the first test sample being generated by this model
x, y = test_data[0]
model.score(x)
__init__(*, n_states=5, topology='left-right', random_state=None, hmmlearn_kwargs={'init_params': 'ste', 'params': 'ste'})[source]

Initializes the CategoricalHMM.

Parameters:
  • n_states (PositiveInt) – Number of states in the Markov chain.

  • topology (Literal['ergodic', 'left-right', 'linear'] | None) –

    Transition topology of the Markov chain — see Topologies.

    • If None, behaves the same as 'ergodic' but with hmmlearn initialization.

  • random_state (NonNegativeInt | RandomState | None) – Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

  • hmmlearn_kwargs (Dict[str, Any]) – Additional key-word arguments provided to the hmmlearn HMM constructor.

Return type:

CategoricalHMM

aic(X, lengths=None)[source]

The Akaike information criterion of the model, evaluated with the maximum likelihood of X.

Parameters:
  • X (Array) –

    Univariate observation sequence(s).

    • Should be a single 1D array.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

Note:

This method requires a trained model — see fit().

Returns:

The Akaike information criterion.

Return type:

float

bic(X, lengths=None)[source]

The Bayesian information criterion of the model, evaluated with the maximum likelihood of X.

Parameters:
  • X (Array) –

    Univariate observation sequence(s).

    • Should be a single 1D array.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

Note:

This method requires a trained model — see fit().

Returns:

The Bayesian information criterion.

Return type:

float

fit(X, lengths=None)[source]

Fits the HMM to the sequences in X, using the Baum—Welch algorithm.

Parameters:
  • X (Array) –

    Univariate observation sequence(s).

    • Should be a single 1D array.

    • Should be a concatenated sequence if multiple sequences are provided, with respective sequence lengths being provided in the lengths argument for decoding the original sequences.

  • lengths (Array | None) –

    Lengths of the observation sequence(s) provided in X.

    • If None, then X is assumed to be a single observation sequence.

    • len(X) should be equal to sum(lengths).

Returns:

The fitted HMM.

Return type:

CategoricalHMM

freeze(params='ste')[source]

Freezes the trainable parameters of the HMM, preventing them from being updated during the Baum—Welch algorithm.

Parameters:

params (str) –

A string specifying which parameters to freeze. Can contain a combination of:

  • 's' for initial state probabilities,

  • 't' for transition probabilities,

  • 'e' for emission probailities.

Note:

If used, this method should normally be called before fit().

See also

unfreeze

Unfreezes the trainable parameters of the HMM, allowing them to be updated during the Baum—Welch algorithm.

n_params()[source]

Retrieves the number of trainable parameters.

Note:

This method requires a trained model — see fit().

Returns:

Number of trainable parameters.

Return type:

NonNegativeInt

score(x)[source]

Calculates the log-likelihood of the HMM generating a single observation sequence.

Parameters:

x (Array) –

Univariate observation sequence.

  • Should be a single 1D array.

Note:

This method requires a trained model — see fit().

Returns:

The log-likelihood.

Return type:

float

set_start_probs(values='random')

Sets the initial state probabilities.

If this method is not called, initial state probabilities are initialized depending on the value of topology provided to __init__().

  • If topology was set to 'ergodic', 'left-right' or 'linear', then random probabilities will be assigned according to the topology by calling set_start_probs() with value='random'.

  • If topology was set to None, then initial state probabilities will be initialized by hmmlearn.

Parameters:

values (Array | Literal['uniform', 'random']) –

Probabilities or probability type to assign as initial state probabilities.

  • If an Array, should be a vector of starting probabilities for each state.

  • If 'uniform', there is an equal probability of starting in any state.

  • If 'random', the vector of initial state probabilities is sampled from a Dirichlet distribution with unit concentration parameters.

Note:

If used, this method should normally be called before fit().

set_state_emissions(values)[source]

Sets the state emission distribution of the HMM’s emission model.

If this method is not called, emission probabilities will be initialized by hmmlearn.

Parameters:

values (Array) – Array of emission probabilities.

Note:

If used, this method should normally be called before fit().

set_transitions(values='random')

Sets the transition probability matrix.

If this method is not called, transition probabilities are initialized depending on the value of topology provided to __init__():

  • If topology was set to 'ergodic', 'left-right' or 'linear', then random probabilities will be assigned according to the topology by calling set_transitions() with value='random'.

  • If topology was set to None, then initial state probabilities will be initialized by hmmlearn.

Parameters:

values (Array | Literal['uniform', 'random']) –

Probabilities or probability type to assign as state transition probabilities.

  • If an Array, should be a matrix of probabilities where each row must some to one and represents the probabilities of transitioning out of a state.

  • If 'uniform', for each state there is an equal probability of transitioning to any state permitted by the topology.

  • If 'random', the vector of transition probabilities for each row is sampled from a Dirichlet distribution with unit concentration parameters, according to the shape of the topology.

Note:

If used, this method should normally be called before fit().

unfreeze(params='ste')[source]

Unfreezes the trainable parameters of the HMM, allowing them to be updated during the Baum—Welch algorithm.

Parameters:

params (str) –

A string specifying which parameters to unfreeze. Can contain a combination of:

  • 's' for initial state probabilities,

  • 't' for transition probabilities,

  • 'e' for emission probailities.

See also

freeze

Freezes the trainable parameters of the HMM, preventing them from being updated during the Baum—Welch algorithm.

hmmlearn_kwargs

Additional key-word arguments provided to the hmmlearn HMM constructor.

model

Underlying HMM object from hmmlearn — only set after fit().

n_states

Number of states in the Markov chain.

random_state

Seed or numpy.random.RandomState object for reproducible pseudo-randomness.

topology

Transition topology of the Markov chain — see Topologies.