Gene Families¶
API reference¶
Definitions¶
- sequentia.datasets.load_gene_families(*, families={0, 1, 2, 3, 4, 5, 6})¶
Load a dataset of human DNA sequences grouped by gene family.
The Human DNA Sequences dataset consists of 4380 DNA sequences belonging to 7 gene families.
This dataset has imbalanced classes, and uses an
sklearn.preprocessing.LabelEncoderto encode the original symbols (A,T,C,G,N) that form the DNA sequences, into integers.The gene families have the following class labels:
G protein coupled receptors:
0Tyrosine kinase:
1Tyrosine phosphatase:
2Synthetase:
3Synthase:
4Ion channel:
5Transcription:
6
- Parameters:
families (set[Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=6)])]]) – Subset of gene families to include in the dataset.
- Returns:
A dataset object representing the loaded genetic data.
Label encoder used to encode the observation symbols into integers.
- Return type:
tuple[SequentialDataset, sklearn.preprocessing.LabelEncoder]