nemo_microservices.data_designer.config.seed#
Module Contents#
Classes#
Configuration for sampling data from a seed dataset. |
|
API#
- class nemo_microservices.data_designer.config.seed.DatastoreSeedDatasetReference(/, **data: typing.Any)#
Bases:
nemo_microservices.data_designer.config.seed.SeedDatasetReference- datastore_settings: nemo_microservices.data_designer.config.datastore.DatastoreSettings#
None
- property filename: str#
- property repo_id: str#
- class nemo_microservices.data_designer.config.seed.IndexRange(/, **data: typing.Any)#
Bases:
nemo_microservices.data_designer.config.base.ConfigBase- end: int#
‘Field(…)’
- property size: int#
- start: int#
‘Field(…)’
- class nemo_microservices.data_designer.config.seed.LocalSeedDatasetReference(/, **data: typing.Any)#
Bases:
nemo_microservices.data_designer.config.seed.SeedDatasetReference- validate_dataset_is_file(v: str) str#
- class nemo_microservices.data_designer.config.seed.PartitionBlock(/, **data: typing.Any)#
Bases:
nemo_microservices.data_designer.config.base.ConfigBase- index: int#
‘Field(…)’
- num_partitions: int#
‘Field(…)’
- to_index_range(
- dataset_size: int,
- class nemo_microservices.data_designer.config.seed.SamplingStrategy#
Bases:
str,enum.Enum- ORDERED#
‘ordered’
- SHUFFLE#
‘shuffle’
- class nemo_microservices.data_designer.config.seed.SeedConfig(/, **data: typing.Any)#
Bases:
nemo_microservices.data_designer.config.base.ConfigBaseConfiguration for sampling data from a seed dataset.
Args: dataset: Path or identifier for the seed dataset. sampling_strategy: Strategy for how to sample rows from the dataset. - ORDERED: Read rows sequentially in their original order. - SHUFFLE: Randomly shuffle rows before sampling. When used with selection_strategy, shuffling occurs within the selected range/partition. selection_strategy: Optional strategy to select a subset of the dataset. - IndexRange: Select a specific range of indices (e.g., rows 100-200). - PartitionBlock: Select a partition by splitting the dataset into N equal parts. Partition indices are zero-based (index=0 is the first partition, index=1 is the second, etc.).
Examples: Read rows sequentially from start to end: SeedConfig(dataset=”my_data.parquet”, sampling_strategy=SamplingStrategy.ORDERED)
Read rows in random order: SeedConfig(dataset="my_data.parquet", sampling_strategy=SamplingStrategy.SHUFFLE) Read specific index range (rows 100-199): SeedConfig( dataset="my_data.parquet", sampling_strategy=SamplingStrategy.ORDERED, selection_strategy=IndexRange(start=100, end=199) ) Read random rows from a specific index range (shuffles within rows 100-199): SeedConfig( dataset="my_data.parquet", sampling_strategy=SamplingStrategy.SHUFFLE, selection_strategy=IndexRange(start=100, end=199) ) Read from partition 2 (3rd partition, zero-based) of 5 partitions (20% of dataset): SeedConfig( dataset="my_data.parquet", sampling_strategy=SamplingStrategy.ORDERED, selection_strategy=PartitionBlock(index=2, num_partitions=5) ) Read shuffled rows from partition 0 of 10 partitions (shuffles within the partition): SeedConfig( dataset="my_data.parquet", sampling_strategy=SamplingStrategy.SHUFFLE, selection_strategy=PartitionBlock(index=0, num_partitions=10) )Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- dataset: str#
None
- sampling_strategy: nemo_microservices.data_designer.config.seed.SamplingStrategy#
None
- selection_strategy: Optional[Union[nemo_microservices.data_designer.config.seed.IndexRange, nemo_microservices.data_designer.config.seed.PartitionBlock]]#
None
- class nemo_microservices.data_designer.config.seed.SeedDatasetReference(/, **data: typing.Any)#
Bases:
abc.ABC,nemo_microservices.data_designer.config.base.ConfigBase- dataset: str#
None