nemo_automodel.components.datasets.llm.megatron.gpt_dataset
nemo_automodel.components.datasets.llm.megatron.gpt_dataset
Module Contents
Classes
Functions
Data
API
Configuration object for Megatron Core datasets
The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights. For example, [[“dataset-path1”, “dataset-path2”], [0.3, 0.7]]. When the weights are None, they are inferred from the lengths of the contributing datasets. Not to be used with ‘blend_per_split’. Defaults to None.
A set of blends, as defined above, one for each split distribution. Not to be used with ‘blend’. Defauls to None.
The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005. This value is irrelevant for single source data blends. This value may need to be increased if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0 in future if the top level dataset is constrained to not oversample the mid level datasets(s).
Whether to mmap the .bin files or use file pointers.
Whether to bypass real data loading and validation in favor of mock data generation. Created automatically from ‘blend’ and ‘blend_per_split’. Not to be passed in to the constructor.
The number of threads to use for dataset building.
When set, the .idx files are downloaded to path_to_idx_cache and .bin files are streamed from S3/MSC via chunked GETs. mmap_bin_files is automatically overridden to False.
Where all re-useable dataset indices are to be cached.
The seed for all RNG during dataset creation.
The sequence length.
The split string, a comma separated weighting for the dataset splits when drawing samples from a single distribution. Not to be used with ‘blend_per_split’. Defaults to None.
The split matrix consisting of non-overlapping book-ends of each split in order. For more information, refer to ‘convert_split_vector_to_split_matrix’. Created automatically from ‘split’. Not to be passed in to the constructor.
The PreTrainedTokenizerBase instance. Required for datasets that do online tokenization.
Do asserts and set fields post init
Bases: Dataset
The base GPT dataset
Parameters:
The IndexedDataset around which to build the GPTDataset
The real path on disk to the dataset, for bookkeeping
The set of the documents indices to expose
The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.
The indexed_indices Split
The config
Abstract method implementation
Parameters:
The index into the dataset
Returns: dict[str, torch.Tensor]
dict[str, torch.Tensor]: The sample information wrapped in a dictionary
Abstract method implementation
Returns: int
The effective length of the dataset, capped by num_samples when provided
Build the document index, the sample index, and the shuffle index
Returns: numpy.ndarray
Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]: The document index, the sample
Calculate the number of epochs
Parameters:
The number of tokens in a single epoch
Returns: int
The number of epochs
Calculate the number of tokens in a single epoch
Returns: int
The number of tokens in a single epoch
Return all config attributes which contribute to uniquely identifying the dataset.
These attributes will be used to build a uniquely identifying string and MD5 hash which will be used to cache/load dataset resources from run to run.
Returns: List[str]
List[str]: The key config attributes
Get the text (token ids) and document ids for a given index
Parameters:
The index into the dataset
Returns: Tuple[numpy.ndarray, numpy.ndarray]
Tuple[numpy.ndarray, numpy.ndarray]: The text ids and document ids
Abstract method implementation
Parameters:
The real path prefix to the IndexedDataset .bin and .idx files
The config
Returns: IndexedDataset
The underlying IndexedDataset
Abstract method implementation
For GPT, the underlying IndexedDataset should be split by sequence, as opposed to, say, BERT, which should be split by document
Parameters:
The underlying IndexedDataset
Returns: int
The number of unique elements in the underlying IndexedDataset
Bases: BlendedMegatronDatasetConfig
Configuration object for Megatron Core GPT datasets
Option to draw sequences with one extra token to ensure the sample input tokens and sample output tokens are both of the desired sequence length
Option to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.
Option to drop the last partial validation sequence
Option to enable the EOD mask loss
Option to reset the attention mask from the dataset
Option to reset the position IDs in the dataset at an interval
Do asserts and set fields post init
Bases: enum.Enum
Dataset split identifiers used by Megatron GPT datasets.
Build an array with length = num epochs * num documents
Parameters:
the subset of exposed document indices
The number of epochs
The NumPy random state
Whether to exclude the last epoch from the global shuffle
Returns: numpy.ndarray
numpy.ndarray: The document index
Build the range [0, size) and shuffle
Parameters:
The size of the first shuffle range [0, num_samples)
The size of the entire index. If larger than ‘num_samples’, it defines the second shuffle range [num_samples, total_size)
The NumPy random state
Returns: numpy.ndarray
numpy.ndarray: The shuffle index
Build masks and position id for left to right model.
Parameters:
The data tenor that holds the tokens from the dataset
ID of the token to that is considered the EOD
Switch to reset the document position ID’s
Switch to reset the attention mask
Switch to enable the EOD mask loss
Switch to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.
Returns:
torch.Tensor: Attention mask needed to be used for Attention
Build the split matrix from one or optionally two contributing split vectors.
Ex. a standard conversion:
[0.99, 0.01, 0.0] -> [(0, 0.99), (0.99, 1.0), None]
Ex. a conversion for Retro when Retro pretraining uses a [0.99, 0.01, 0.0] split and Retro preprocessing used a [0.98, 0.02, 0.0] split:
[0.99, 0.01, 0.0], [0.98, 0.02, 0.0] -> [(0, 0.98), (0.99, 1.0), None]
Parameters:
The primary split vector
An optional secondary split vector which constrains the primary split vector. Defaults to None.
Returns: List[Optional[Tuple[float, float]]]
List[Tuple[float, float]]: The split matrix consisting of book-ends of each split in order
Do non-exponentiated normalization
Parameters:
The weights
Returns: list[float]
List[float]: The normalized weights
Parse the dataset split ratios from a string
Parameters:
The train valid test split string e.g. “99,1,0”
Returns: List[float]
List[float]: The trian valid test split ratios e.g. [0.99, 0.01, 0.0]