core.datasets.megatron_dataset#

Module Contents#

Classes#

MegatronDataset

The highest level wrapper class from which all dataset classes should inherit

Data#

API#

core.datasets.megatron_dataset.LowLevelDataset#

None

core.datasets.megatron_dataset._PAD_TOKEN_ID#

None

class core.datasets.megatron_dataset.MegatronDataset(
dataset: core.datasets.megatron_dataset.LowLevelDataset,
dataset_path: Optional[str],
indices: numpy.ndarray,
num_samples: Optional[int],
index_split: megatron.core.datasets.utils.Split,
config: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig,
)#

Bases: abc.ABC, torch.utils.data.Dataset

The highest level wrapper class from which all dataset classes should inherit

Parameters:
  • dataset (LowLevelDataset) – The dataset around which to build the MegatronDataset

  • dataset_path (Optional[str]) – The real path on disk to the dataset, for bookkeeping

  • indices (numpy.ndarray) – The set of the documents indices to expose

  • num_samples (Optional[int]) – The minimum number of samples to build from the indexed dataset. When None, build as many samples as correspond to one epoch.

  • index_split (Split) – The indices Split

  • config (BlendedMegatronDatasetConfig) – The config

Initialization

abstractmethod static numel_low_level_dataset(
low_level_dataset: core.datasets.megatron_dataset.LowLevelDataset,
) int#

Return the number of elements in the underlying low level dataset for the purpose of segregating the train/valid/test split indices

It may be that the low level dataset can be split any number of ways, depending on the mid level dataset it supports, which is why we define the “number of elements” function separately from the len function here in the mid level dataset class

Parameters:

low_level_dataset (LowLevelDataset) – The underlying low level dataset

Returns:

The number of elements in the underlying low level dataset

Return type:

int

abstractmethod static build_low_level_dataset(
dataset_path: str,
config: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig,
) core.datasets.megatron_dataset.LowLevelDataset#

Build the low level dataset via a function to be called from within BlendedMegatronDatasetBuilder.build_generic_dataset

It may be that the low level dataset spans any subset of train/valid/test splits, which is why we define a static “build” function separately from the constructor in the mid level dataset class

Parameters:
Returns:

The low level dataset

Return type:

LowLevelDataset

static _key_config_attributes() List[str]#

Return all config attributes which contribute to uniquely identifying the dataset.

These attributes will be used to build a uniquely identifying string and MD5 hash which will be used to cache/load dataset resources from run to run.

Returns:

The key config attributes

Return type:

List[str]

abstractmethod __len__() int#

Return the length of the dataset

Returns:

See abstract implementation

Return type:

int

abstractmethod __getitem__(
idx: int,
) Dict[str, Union[torch.Tensor, numpy.ndarray]]#

Return from the dataset

Parameters:

idx (int) – The index into the dataset

Returns:

See abstract implementation

Return type:

Dict[str, Union[torch.Tensor, numpy.ndarray]]