Data#

class nemo.collections.common.data.dataset.ConcatDataset(*args: Any, **kwargs: Any)[source]#

Bases: IterableDataset

A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique. :param datasets: A list of datasets to sample from. :type datasets: list :param shuffle: Whether to shuffle individual datasets. Only works with non-iterable datasets.

Defaults to True.

Parameters
  • sampling_technique (str) – Sampling technique to choose which dataset to draw a sample from. Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.

  • sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.

  • sampling_scale – Gives you the ability to upsample / downsample the dataset. Defaults to 1.

  • sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.

  • seed – Optional value to seed the numpy RNG.

  • global_rank (int) – Worker rank, used for partitioning map style datasets. Defaults to 0.

  • world_size (int) – Total number of processes, used for partitioning map style datasets. Defaults to 1.

get_iterable(dataset)[source]#
static random_generator(datasets, **kwargs)[source]#
static round_robin_generator(datasets, **kwargs)[source]#
static temperature_generator(datasets, **kwargs)[source]#
class nemo.collections.common.data.dataset.ConcatMapDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique. :param datasets: A list of datasets to sample from. :type datasets: list :param sampling_technique: Sampling technique to choose which dataset to draw a sample from.

Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.

Parameters
  • sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.

  • sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.

  • seed – Optional value to seed the numpy RNG.