Data

class nemo.collections.common.data.dataset.ConcatDataset(*args: Any, **kwargs: Any)

Bases: torch.utils.data.IterableDataset

A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique.

Parameters
  • datasets (list) – A list of datasets to sample from.

  • shuffle (bool) – Whether to shuffle individual datasets. Only works with non-iterable datasets. Defaults to True.

  • sampling_technique (str) – Sampling technique to choose which dataset to draw a sample from. Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.

  • sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.

  • sampling_scale – Gives you the ability to upsample / downsample the dataset. Defaults to 1.

  • sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.

  • seed – Optional value to seed the numpy RNG.

  • global_rank (int) – Worker rank, used for partitioning map style datasets. Defaults to 0.

  • world_size (int) – Total number of processes, used for partitioning map style datasets. Defaults to 1.

get_iterable(dataset)

static random_generator(datasets, **kwargs)

static round_robin_generator(datasets, **kwargs)

static temperature_generator(datasets, **kwargs)

class nemo.collections.common.data.dataset.ConcatMapDataset(*args: Any, **kwargs: Any)

Bases: torch.utils.data.Dataset

A dataset that accepts as argument multiple datasets and then samples from them based on the specified sampling technique.

Parameters
  • datasets (list) – A list of datasets to sample from.

  • sampling_technique (str) – Sampling technique to choose which dataset to draw a sample from. Defaults to ‘temperature’. Currently supports ‘temperature’, ‘random’ and ‘round-robin’.

  • sampling_temperature (int) – Temperature value for sampling. Only used when sampling_technique = ‘temperature’. Defaults to 5.

  • sampling_probabilities (list) – Probability values for sampling. Only used when sampling_technique = ‘random’.

  • seed – Optional value to seed the numpy RNG.

Previous Tokenizers
Next Tools
© Copyright 2023-2024, NVIDIA. Last updated on Apr 12, 2024.