Gene Families¶
API reference¶
- sequentia.datasets.load_gene_families(*, families={0, 1, 2, 3, 4, 5, 6})¶
Load a dataset of human DNA sequences grouped by gene family.
The Human DNA Sequences dataset consists of 4380 DNA sequences belonging to 7 gene families.
This dataset has imbalanced classes, and uses an
sklearn.preprocessing.LabelEncoder
to encode the original symbols (A
,T
,C
,G
,N
) that form the DNA sequences, into integers.The gene families have the following class labels:
G protein coupled receptors:
0
Tyrosine kinase:
1
Tyrosine phosphatase:
2
Synthetase:
3
Synthase:
4
Ion channel:
5
Transcription:
6
- Parameters:
families (set[int]) – Subset of gene families to include in the dataset.
- Returns:
A dataset object representing the loaded genetic data.
Label encoder used to encode the observation symbols into integers.
- Return type:
tuple[SequentialDataset, sklearn.preprocessing.LabelEncoder]