core.datasets.retro.query.multi_split_gpt_dataset#
A MultiSplitGPTDataset can handle multiple intersecting split strings, as well as returning all of the document IDs of a sample.
Module Contents#
Classes#
Configuration object for Megatron Core blended and Retro datasets. |
|
Retro’s customized GPT dataset. |
Data#
API#
- core.datasets.retro.query.multi_split_gpt_dataset.logger#
‘getLogger(…)’
- class core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDatasetConfig#
Bases:
megatron.core.datasets.gpt_dataset.GPTDatasetConfigConfiguration object for Megatron Core blended and Retro datasets.
- Parameters:
return_document_ids (bool) – Whether to return the document ids when querying the dataset. Turn this option on during preprocessing.
split_preprocessing (str) – The Retro preprocessing split string. It follows the same pattern convention as ‘split’. Not to be used with ‘blend_per_split’.
- return_document_ids: bool#
None
- split_preprocessing: str#
None
- __post_init__() None#
Validate config attributes.
- class core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDataset(
- indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
- dataset_path: str,
- indexed_indices: numpy.ndarray,
- num_samples: int,
- index_split: megatron.core.datasets.utils.Split,
- config: core.datasets.retro.query.multi_split_gpt_dataset.MultiSplitGPTDatasetConfig,
Bases:
megatron.core.datasets.gpt_dataset.GPTDatasetRetro’s customized GPT dataset.
- Parameters:
indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset.
dataset_path (str) – The real path on disk to the dataset, for bookkeeping.
indexed_indices (numpy.ndarray) – The set of the documents indices to expose.
num_samples (int) – The number of samples to draw from the indexed dataset.
index_split (Split) – The indexed_indices Split.
config (MultiSplitGPTDatasetConfig) – The Retro-specific container for all config sourced parameters.
Initialization
- __getitem__(idx: int) Dict[str, numpy.ndarray]#
Get dataset sample.
- Parameters:
idx (int) – The index into the dataset.
- Returns:
The text ids and (optionally) the document ids wrapped in a dictionary.
- Return type:
Dict[str, numpy.ndarray]
- static _key_config_attributes() List[str]#
Add custom attributes for building unique dataset hash.
The preprocessing split used for preprocessing will constrain the samples available for pretraining.
- Returns:
The key config attributes.
- Return type:
List[str]