Gene Families

API reference

sequentia.datasets.load_gene_families(*, families={0, 1, 2, 3, 4, 5, 6})

Load a dataset of human DNA sequences grouped by gene family.

The Human DNA Sequences dataset consists of 4380 DNA sequences belonging to 7 gene families.

This dataset has imbalanced classes, and uses an sklearn.preprocessing.LabelEncoder to encode the original symbols (A, T, C, G, N) that form the DNA sequences, into integers.

The gene families have the following class labels:

  • G protein coupled receptors: 0

  • Tyrosine kinase: 1

  • Tyrosine phosphatase: 2

  • Synthetase: 3

  • Synthase: 4

  • Ion channel: 5

  • Transcription: 6

Parameters:

families (set[int]) – Subset of gene families to include in the dataset.

Returns:

  • A dataset object representing the loaded genetic data.

  • Label encoder used to encode the observation symbols into integers.

Return type:

tuple[SequentialDataset, sklearn.preprocessing.LabelEncoder]