Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Data
- class nemo.collections.common.data.dataset.ConcatDataset(*args: Any, **kwargs: Any)
Bases:
torch.utils.data.IterableDataset
A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique.
- Parameters
datasets (list) – A list of datasets to sample from.
shuffle (bool) – Whether to shuffle individual datasets. Only works with non-iterable datasets. Defaults to True.
sampling_technique (str) – Sampling technique to choose which dataset to draw a sample from. Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.
sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.
sampling_scale – Gives you the ability to upsample / downsample the dataset. Defaults to 1.
sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.
seed – Optional value to seed the numpy RNG.
global_rank (int) – Worker rank, used for partitioning map style datasets. Defaults to 0.
world_size (int) – Total number of processes, used for partitioning map style datasets. Defaults to 1.
- get_iterable(dataset)
- static random_generator(datasets, **kwargs)
- static round_robin_generator(datasets, **kwargs)
- static temperature_generator(datasets, **kwargs)
- class nemo.collections.common.data.dataset.ConcatMapDataset(*args: Any, **kwargs: Any)
Bases:
torch.utils.data.Dataset
A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique.
- Parameters
datasets (list) – A list of datasets to sample from.
sampling_technique (str) – Sampling technique to choose which dataset to draw a sample from. Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.
sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.
sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.
seed – Optional value to seed the numpy RNG.