core.datasets.blended_dataset#

Module Contents#

Classes#

BlendedDataset

Conjugating class for a set of MegatronDataset instances

Data#

API#

core.datasets.blended_dataset.logger#

‘getLogger(…)’

core.datasets.blended_dataset._VERBOSE#

False

class core.datasets.blended_dataset.BlendedDataset(
datasets: List[megatron.core.datasets.megatron_dataset.MegatronDataset],
weights: List[Union[int, float]],
size: Optional[int],
config: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig,
)#

Bases: torch.utils.data.Dataset

Conjugating class for a set of MegatronDataset instances

Parameters:
  • datasets (List[MegatronDataset]) – The MegatronDataset instances to blend

  • weights (List[Union[int, float]]) – The weights that determine the dataset blend ratios

  • size (Optional[int]) – The number of samples to draw from the blend. If None, for each dataset index idx draw exactly weights[idx] samples from datasets[idx].

  • config (BlendedMegatronDatasetConfig) – The config

Raises:

RuntimeError – When the dataset has fewer or more samples than ‘size’ post-initialization

Initialization

__len__() int#
__getitem__(
idx: int,
) Dict[str, Union[int, numpy.ndarray]]#
_build_indices() Tuple[numpy.ndarray, numpy.ndarray]#

Build and optionally cache the dataset index and the dataset sample index

The dataset index is a 1-D mapping which determines the dataset to query. The dataset sample index is a 1-D mapping which determines the sample to request from the queried dataset.

Returns:

The dataset index and the dataset sample index

Return type:

Tuple[numpy.ndarray, numpy.ndarray]