core.datasets.retro.query.multi_split_gpt_dataset#

A MultiSplitGPTDataset can handle multiple intersecting split strings, as well as returning all of the document IDs of a sample.

Module Contents#

Classes#

MultiSplitGPTDatasetConfig

Configuration object for Megatron Core blended and Retro datasets.

MultiSplitGPTDataset

Retro’s customized GPT dataset.

Data#

API#

core.datasets.retro.query.multi_split_gpt_dataset.logger#

‘getLogger(…)’

class core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDatasetConfig#

Bases: megatron.core.datasets.gpt_dataset.GPTDatasetConfig

Configuration object for Megatron Core blended and Retro datasets.

Parameters:
  • return_document_ids (bool) – Whether to return the document ids when querying the dataset. Turn this option on during preprocessing.

  • split_preprocessing (str) – The Retro preprocessing split string. It follows the same pattern convention as ‘split’. Not to be used with ‘blend_per_split’.

return_document_ids: bool#

None

split_preprocessing: str#

None

__post_init__() None#

Validate config attributes.

class core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDataset(
indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
dataset_path: str,
indexed_indices: numpy.ndarray,
num_samples: int,
index_split: megatron.core.datasets.utils.Split,
config: core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDatasetConfig,
)#

Bases: megatron.core.datasets.gpt_dataset.GPTDataset

Retro’s customized GPT dataset.

Parameters:
  • indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset.

  • dataset_path (str) – The real path on disk to the dataset, for bookkeeping.

  • indexed_indices (numpy.ndarray) – The set of the documents indices to expose.

  • num_samples (int) – The number of samples to draw from the indexed dataset.

  • index_split (Split) – The indexed_indices Split.

  • config (MultiSplitGPTDatasetConfig) – The Retro-specific container for all config sourced parameters.

Initialization

__getitem__(idx: int) Dict[str, numpy.ndarray]#

Get dataset sample.

Parameters:

idx (int) – The index into the dataset.

Returns:

The text ids and (optionally) the document ids wrapped in a dictionary.

Return type:

Dict[str, numpy.ndarray]

static _key_config_attributes() List[str]#

Add custom attributes for building unique dataset hash.

The preprocessing split used for preprocessing will constrain the samples available for pretraining.

Returns:

The key config attributes.

Return type:

List[str]