Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Data#
- class nemo.collections.common.data.dataset.ConcatDataset(*args: Any, **kwargs: Any)#
Bases:
IterableDataset
A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique.
- Parameters:
datasets (list) – A list of datasets to sample from.
shuffle (bool) – Whether to shuffle individual datasets. Only works with non-iterable datasets. Defaults to True.
sampling_technique (str) – Sampling technique to choose which dataset to draw a sample from. Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.
sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.
sampling_scale – Gives you the ability to upsample / downsample the dataset. Defaults to 1.
sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.
seed – Optional value to seed the numpy RNG.
global_rank (int) – Worker rank, used for partitioning map style datasets. Defaults to 0.
world_size (int) – Total number of processes, used for partitioning map style datasets. Defaults to 1.
- get_iterable(dataset)#
- static random_generator(datasets, **kwargs)#
- static round_robin_generator(datasets, **kwargs)#
- static temperature_generator(datasets, **kwargs)#
- class nemo.collections.common.data.dataset.ConcatMapDataset(*args: Any, **kwargs: Any)#
Bases:
Dataset
A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique.
- Parameters:
datasets (list) – A list of datasets to sample from.
sampling_technique (str) – Sampling technique to choose which dataset to draw a sample from. Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.
sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.
sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.
seed – Optional value to seed the numpy RNG.