`core.datasets.retro.query.multi_split_gpt_dataset`#

A MultiSplitGPTDataset can handle multiple intersecting split strings, as well as returning all of the document IDs of a sample.

Module Contents#

Classes#

`MultiSplitGPTDatasetConfig`	Configuration object for Megatron Core blended and Retro datasets.
`MultiSplitGPTDataset`	Retro’s customized GPT dataset.

Data#

logger

API#

core.datasets.retro.query.multi_split_gpt_dataset.logger#: ‘getLogger(…)’

class core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDatasetConfig#

Bases: megatron.core.datasets.gpt_dataset.GPTDatasetConfig

Configuration object for Megatron Core blended and Retro datasets.

Parameters:

return_document_ids (bool) – Whether to return the document ids when querying the dataset. Turn this option on during preprocessing.
split_preprocessing (str) – The Retro preprocessing split string. It follows the same pattern convention as ‘split’. Not to be used with ‘blend_per_split’.

return_document_ids: bool#: None

split_preprocessing: str#: None

__post_init__() → None#: Validate config attributes.

class core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDataset( indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset, dataset_path: str, indexed_indices: numpy.ndarray, num_samples: int, index_split: megatron.core.datasets.utils.Split, config: core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDatasetConfig, )#

Bases: megatron.core.datasets.gpt_dataset.GPTDataset

Retro’s customized GPT dataset.

Parameters:

indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset.
dataset_path (str) – The real path on disk to the dataset, for bookkeeping.
indexed_indices (numpy.ndarray) – The set of the documents indices to expose.
num_samples (int) – The number of samples to draw from the indexed dataset.
index_split (Split) – The indexed_indices Split.
config (MultiSplitGPTDatasetConfig) – The Retro-specific container for all config sourced parameters.

Initialization

__getitem__(idx: int) → Dict[str, numpy.ndarray]#

Get dataset sample.

Parameters:: idx (int) – The index into the dataset.
Returns:: The text ids and (optionally) the document ids wrapped in a dictionary.
Return type:: Dict[str, numpy.ndarray]

static _key_config_attributes() → List[str]#

Add custom attributes for building unique dataset hash.

The preprocessing split used for preprocessing will constrain the samples available for pretraining.

Returns:: The key config attributes.
Return type:: List[str]

core.datasets.retro.query.multi_split_gpt_dataset#

Module Contents#

Classes#

Data#

API#

`core.datasets.retro.query.multi_split_gpt_dataset`#