nemo_automodel.components.datasets.llm.megatron.gpt_dataset
#
Module Contents#
Classes#
Configuration object for Megatron Core datasets |
|
Configuration object for Megatron Core GPT datasets |
|
The base GPT dataset |
Functions#
Parse the dataset split ratios from a string |
|
Build the split matrix from one or optionally two contributing split vectors. |
|
Build an array with length = num epochs * num documents |
|
Build the range [0, size) and shuffle |
|
Build masks and position id for left to right model. |
|
Do non-exponentiated normalization |
Data#
API#
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset._PAD_TOKEN_ID#
None
- class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split(*args, **kwds)#
Bases:
enum.Enum
- train#
0
- valid#
1
- test#
2
- class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig#
Configuration object for Megatron Core datasets
- random_seed: int#
None
The seed for all RNG during dataset creation.
- sequence_length: int#
None
The sequence length.
- blend: Optional[Tuple[List[str], Optional[List[float]]]]#
None
The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights. For example, [[“dataset-path1”, “dataset-path2”], [0.3, 0.7]]. When the weights are None, they are inferred from the lengths of the contributing datasets. Not to be used with ‘blend_per_split’. Defaults to None.
- blend_per_split: Optional[List[Optional[Tuple[List[str], Optional[List[float]]]]]]#
None
A set of blends, as defined above, one for each split distribution. Not to be used with ‘blend’. Defauls to None.
- split: Optional[str]#
None
The split string, a comma separated weighting for the dataset splits when drawing samples from a single distribution. Not to be used with ‘blend_per_split’. Defaults to None.
- split_matrix: Optional[List[Tuple[float, float]]]#
‘field(…)’
The split matrix consisting of non-overlapping book-ends of each split in order. For more information, refer to ‘convert_split_vector_to_split_matrix’. Created automatically from ‘split’. Not to be passed in to the constructor.
- num_dataset_builder_threads: int#
1
The number of threads to use for dataset building.
- path_to_cache: Optional[str]#
None
Where all re-useable dataset indices are to be cached.
- mmap_bin_files: bool#
True
Whether to mmap the .bin files or use file pointers.
- mock: bool#
‘field(…)’
Whether to bypass real data loading and validation in favor of mock data generation. Created automatically from ‘blend’ and ‘blend_per_split’. Not to be passed in to the constructor.
- tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase]#
None
The PreTrainedTokenizerBase instance. Required for datasets that do online tokenization.
- mid_level_dataset_surplus: float#
0.005
The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005. This value is irrelevant for single source data blends. This value may need to be increased if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0 in future if the top level dataset is constrained to not oversample the mid level datasets(s).
- __post_init__() None #
Do asserts and set fields post init
- class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig#
Bases:
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig
Configuration object for Megatron Core GPT datasets
- reset_position_ids: Optional[bool]#
None
Option to reset the position IDs in the dataset at an interval
- reset_attention_mask: Optional[bool]#
None
Option to reset the attention mask from the dataset
- eod_mask_loss: Optional[bool]#
None
Option to enable the EOD mask loss
- create_attention_mask: bool#
True
Option to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.
- drop_last_partial_validation_sequence: bool#
True
Option to drop the last partial validation sequence
- add_extra_token_to_sequence: bool#
True
Option to draw sequences with one extra token to ensure the sample input tokens and sample output tokens are both of the desired sequence length
- __post_init__() None #
Do asserts and set fields post init
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset.parse_and_normalize_split(split: str) List[float] #
Parse the dataset split ratios from a string
- Parameters:
split (str) – The train valid test split string e.g. “99,1,0”
- Returns:
The trian valid test split ratios e.g. [0.99, 0.01, 0.0]
- Return type:
List[float]
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset.convert_split_vector_to_split_matrix(
- vector_a: List[float],
- vector_b: Optional[List[float]] = None,
Build the split matrix from one or optionally two contributing split vectors.
Ex. a standard conversion:
[0.99, 0.01, 0.0] -> [(0, 0.99), (0.99, 1.0), None]
Ex. a conversion for Retro when Retro pretraining uses a [0.99, 0.01, 0.0] split and Retro preprocessing used a [0.98, 0.02, 0.0] split:
[0.99, 0.01, 0.0], [0.98, 0.02, 0.0] -> [(0, 0.98), (0.99, 1.0), None]
- Parameters:
vector_a (List[float]) – The primary split vector
vector_b (Optional[List[float]]) – An optional secondary split vector which constrains the primary split vector. Defaults to None.
- Returns:
The split matrix consisting of book-ends of each split in order
- Return type:
List[Tuple[float, float]]
- class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset(
- indexed_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset,
- dataset_path: Optional[str],
- indexed_indices: numpy.ndarray,
- num_samples: Optional[int],
- index_split: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split,
- config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig,
Bases:
torch.utils.data.Dataset
The base GPT dataset
- Parameters:
indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the GPTDataset
dataset_path (Optional[str]) – The real path on disk to the dataset, for bookkeeping
indexed_indices (numpy.ndarray) – The set of the documents indices to expose
num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.
index_split (Split) – The indexed_indices Split
config (GPTDatasetConfig) – The config
Initialization
- static _key_config_attributes() List[str] #
Return all config attributes which contribute to uniquely identifying the dataset.
These attributes will be used to build a uniquely identifying string and MD5 hash which will be used to cache/load dataset resources from run to run.
- Returns:
The key config attributes
- Return type:
List[str]
- static numel_low_level_dataset(
- low_level_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset,
Abstract method implementation
For GPT, the underlying IndexedDataset should be split by sequence, as opposed to, say, BERT, which should be split by document
- Parameters:
low_level_dataset (IndexedDataset) – The underlying IndexedDataset
- Returns:
The number of unique elements in the underlying IndexedDataset
- Return type:
int
- static build_low_level_dataset(
- dataset_path: str,
- config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig,
Abstract method implementation
- Parameters:
dataset_path (str) – The real path prefix to the IndexedDataset .bin and .idx files
config (GPTDatasetConfig) – The config
- Returns:
The underlying IndexedDataset
- Return type:
- __len__() int #
Abstract method implementation
- Returns:
The effective length of the dataset, capped by num_samples when provided
- Return type:
int
- __getitem__(idx: Optional[int]) dict[str, torch.Tensor] #
Abstract method implementation
- Parameters:
idx (Optioal[int]) – The index into the dataset
- Returns:
The sample information wrapped in a dictionary
- Return type:
dict[str, torch.Tensor]
- _query_document_sample_shuffle_indices(
- idx: int,
Get the text (token ids) and document ids for a given index
- Parameters:
idx (int) – The index into the dataset
- Returns:
The text ids and document ids
- Return type:
Tuple[numpy.ndarray, numpy.ndarray]
- _build_document_sample_shuffle_indices() Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] #
Build the document index, the sample index, and the shuffle index
The document index: – 1-D – An ordered array of document ids
The sample index: – 2-D – The document indices and offsets which mark the start of every sample
The shuffle index: – 1-D – A random permutation of index range of the sample index
- Returns:
The document index, the sample index, and the shuffle index
- Return type:
Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]
- _get_num_tokens_per_epoch() int #
Calculate the number of tokens in a single epoch
- Returns:
The number of tokens in a single epoch
- Return type:
int
- _get_num_epochs(num_tokens_per_epoch: int) int #
Calculate the number of epochs
- Parameters:
num_tokens_per_epoch (int) – The number of tokens in a single epoch
- Returns:
The number of epochs
- Return type:
int
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_document_index(
- documents: numpy.ndarray,
- num_epochs: int,
- numpy_random_state: numpy.random.RandomState,
- separate_final_epoch: bool,
Build an array with length = num epochs * num documents
- Parameters:
documents (numpy.ndarray) – the subset of exposed document indices
num_epochs (int) – The number of epochs
numpy_random_state (numpy.random.RandomState) – The NumPy random state
separate_final_epoch (bool) – Whether to exclude the last epoch from the global shuffle
- Returns:
The document index
- Return type:
numpy.ndarray
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_shuffle_index(
- num_samples: int,
- total_size: int,
- numpy_random_state: numpy.random.RandomState,
Build the range [0, size) and shuffle
- Parameters:
num_samples (int) – The size of the first shuffle range [0, num_samples)
total_size (int) – The size of the entire index. If larger than ‘num_samples’, it defines the second shuffle range [num_samples, total_size)
numpy_random_state (numpy.random.RandomState) – The NumPy random state
- Returns:
The shuffle index
- Return type:
numpy.ndarray
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset._get_ltor_masks_and_position_ids(
- data: torch.Tensor,
- eod_token: int,
- reset_position_ids: bool,
- reset_attention_mask: bool,
- eod_mask_loss: bool,
- create_attention_mask: bool,
Build masks and position id for left to right model.
- Parameters:
data (torch.Tensor) – The data tenor that holds the tokens from the dataset
eod_token (int) – ID of the token to that is considered the EOD
reset_position_ids (bool) – Switch to reset the document position ID’s
reset_attention_mask (bool) – Switch to reset the attention mask
eod_mask_loss (bool) – Switch to enable the EOD mask loss
create_attention_mask (bool) – Switch to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.
- Returns:
Attention mask needed to be used for Attention
torch.Tensor: The mask used for loss value during training
torch.Tensor: The position ID’s of the token
- Return type:
torch.Tensor
- nemo_automodel.components.datasets.llm.megatron.gpt_dataset.normalize(weights: list[float]) list[float] #
Do non-exponentiated normalization
- Parameters:
weights (List[float]) – The weights
- Returns:
The normalized weights
- Return type:
List[float]