nemo_automodel.components.datasets.llm.megatron.gpt_dataset#

Module Contents#

Classes#

Split

BlendedMegatronDatasetConfig

Configuration object for Megatron Core datasets

GPTDatasetConfig

Configuration object for Megatron Core GPT datasets

GPTDataset

The base GPT dataset

Functions#

parse_and_normalize_split

Parse the dataset split ratios from a string

convert_split_vector_to_split_matrix

Build the split matrix from one or optionally two contributing split vectors.

_build_document_index

Build an array with length = num epochs * num documents

_build_shuffle_index

Build the range [0, size) and shuffle

_get_ltor_masks_and_position_ids

Build masks and position id for left to right model.

normalize

Do non-exponentiated normalization

Data#

API#

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._PAD_TOKEN_ID#

None

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split(*args, **kwds)#

Bases: enum.Enum

train#

0

valid#

1

test#

2

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig#

Configuration object for Megatron Core datasets

random_seed: int#

None

The seed for all RNG during dataset creation.

sequence_length: int#

None

The sequence length.

blend: Optional[Tuple[List[str], Optional[List[float]]]]#

None

The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights. For example, [[“dataset-path1”, “dataset-path2”], [0.3, 0.7]]. When the weights are None, they are inferred from the lengths of the contributing datasets. Not to be used with ‘blend_per_split’. Defaults to None.

blend_per_split: Optional[List[Optional[Tuple[List[str], Optional[List[float]]]]]]#

None

A set of blends, as defined above, one for each split distribution. Not to be used with ‘blend’. Defauls to None.

split: Optional[str]#

None

The split string, a comma separated weighting for the dataset splits when drawing samples from a single distribution. Not to be used with ‘blend_per_split’. Defaults to None.

split_matrix: Optional[List[Tuple[float, float]]]#

‘field(…)’

The split matrix consisting of non-overlapping book-ends of each split in order. For more information, refer to ‘convert_split_vector_to_split_matrix’. Created automatically from ‘split’. Not to be passed in to the constructor.

num_dataset_builder_threads: int#

1

The number of threads to use for dataset building.

path_to_cache: Optional[str]#

None

Where all re-useable dataset indices are to be cached.

mmap_bin_files: bool#

True

Whether to mmap the .bin files or use file pointers.

mock: bool#

‘field(…)’

Whether to bypass real data loading and validation in favor of mock data generation. Created automatically from ‘blend’ and ‘blend_per_split’. Not to be passed in to the constructor.

tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase]#

None

The PreTrainedTokenizerBase instance. Required for datasets that do online tokenization.

mid_level_dataset_surplus: float#

0.005

The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005. This value is irrelevant for single source data blends. This value may need to be increased if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0 in future if the top level dataset is constrained to not oversample the mid level datasets(s).

__post_init__() None#

Do asserts and set fields post init

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig#

Bases: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig

Configuration object for Megatron Core GPT datasets

reset_position_ids: Optional[bool]#

None

Option to reset the position IDs in the dataset at an interval

reset_attention_mask: Optional[bool]#

None

Option to reset the attention mask from the dataset

eod_mask_loss: Optional[bool]#

None

Option to enable the EOD mask loss

create_attention_mask: bool#

True

Option to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.

drop_last_partial_validation_sequence: bool#

True

Option to drop the last partial validation sequence

add_extra_token_to_sequence: bool#

True

Option to draw sequences with one extra token to ensure the sample input tokens and sample output tokens are both of the desired sequence length

__post_init__() None#

Do asserts and set fields post init

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.parse_and_normalize_split(split: str) List[float]#

Parse the dataset split ratios from a string

Parameters:

split (str) – The train valid test split string e.g. “99,1,0”

Returns:

The trian valid test split ratios e.g. [0.99, 0.01, 0.0]

Return type:

List[float]

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.convert_split_vector_to_split_matrix(
vector_a: List[float],
vector_b: Optional[List[float]] = None,
) List[Optional[Tuple[float, float]]]#

Build the split matrix from one or optionally two contributing split vectors.

Ex. a standard conversion:

[0.99, 0.01, 0.0] -> [(0, 0.99), (0.99, 1.0), None]

Ex. a conversion for Retro when Retro pretraining uses a [0.99, 0.01, 0.0] split and Retro preprocessing used a [0.98, 0.02, 0.0] split:

[0.99, 0.01, 0.0], [0.98, 0.02, 0.0] -> [(0, 0.98), (0.99, 1.0), None]

Parameters:
  • vector_a (List[float]) – The primary split vector

  • vector_b (Optional[List[float]]) – An optional secondary split vector which constrains the primary split vector. Defaults to None.

Returns:

The split matrix consisting of book-ends of each split in order

Return type:

List[Tuple[float, float]]

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset(
indexed_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset,
dataset_path: Optional[str],
indexed_indices: numpy.ndarray,
num_samples: Optional[int],
index_split: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split,
config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig,
)#

Bases: torch.utils.data.Dataset

The base GPT dataset

Parameters:
  • indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the GPTDataset

  • dataset_path (Optional[str]) – The real path on disk to the dataset, for bookkeeping

  • indexed_indices (numpy.ndarray) – The set of the documents indices to expose

  • num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.

  • index_split (Split) – The indexed_indices Split

  • config (GPTDatasetConfig) – The config

Initialization

static _key_config_attributes() List[str]#

Return all config attributes which contribute to uniquely identifying the dataset.

These attributes will be used to build a uniquely identifying string and MD5 hash which will be used to cache/load dataset resources from run to run.

Returns:

The key config attributes

Return type:

List[str]

static numel_low_level_dataset(
low_level_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset,
) int#

Abstract method implementation

For GPT, the underlying IndexedDataset should be split by sequence, as opposed to, say, BERT, which should be split by document

Parameters:

low_level_dataset (IndexedDataset) – The underlying IndexedDataset

Returns:

The number of unique elements in the underlying IndexedDataset

Return type:

int

static build_low_level_dataset(
dataset_path: str,
config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig,
) nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset#

Abstract method implementation

Parameters:
  • dataset_path (str) – The real path prefix to the IndexedDataset .bin and .idx files

  • config (GPTDatasetConfig) – The config

Returns:

The underlying IndexedDataset

Return type:

IndexedDataset

__len__() int#

Abstract method implementation

Returns:

The effective length of the dataset, capped by num_samples when provided

Return type:

int

__getitem__(idx: Optional[int]) dict[str, torch.Tensor]#

Abstract method implementation

Parameters:

idx (Optioal[int]) – The index into the dataset

Returns:

The sample information wrapped in a dictionary

Return type:

dict[str, torch.Tensor]

_query_document_sample_shuffle_indices(
idx: int,
) Tuple[numpy.ndarray, numpy.ndarray]#

Get the text (token ids) and document ids for a given index

Parameters:

idx (int) – The index into the dataset

Returns:

The text ids and document ids

Return type:

Tuple[numpy.ndarray, numpy.ndarray]

_build_document_sample_shuffle_indices() Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]#

Build the document index, the sample index, and the shuffle index

The document index: – 1-D – An ordered array of document ids

The sample index: – 2-D – The document indices and offsets which mark the start of every sample

The shuffle index: – 1-D – A random permutation of index range of the sample index

Returns:

The document index, the sample index, and the shuffle index

Return type:

Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

_get_num_tokens_per_epoch() int#

Calculate the number of tokens in a single epoch

Returns:

The number of tokens in a single epoch

Return type:

int

_get_num_epochs(num_tokens_per_epoch: int) int#

Calculate the number of epochs

Parameters:

num_tokens_per_epoch (int) – The number of tokens in a single epoch

Returns:

The number of epochs

Return type:

int

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_document_index(
documents: numpy.ndarray,
num_epochs: int,
numpy_random_state: numpy.random.RandomState,
separate_final_epoch: bool,
) numpy.ndarray#

Build an array with length = num epochs * num documents

Parameters:
  • documents (numpy.ndarray) – the subset of exposed document indices

  • num_epochs (int) – The number of epochs

  • numpy_random_state (numpy.random.RandomState) – The NumPy random state

  • separate_final_epoch (bool) – Whether to exclude the last epoch from the global shuffle

Returns:

The document index

Return type:

numpy.ndarray

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_shuffle_index(
num_samples: int,
total_size: int,
numpy_random_state: numpy.random.RandomState,
) numpy.ndarray#

Build the range [0, size) and shuffle

Parameters:
  • num_samples (int) – The size of the first shuffle range [0, num_samples)

  • total_size (int) – The size of the entire index. If larger than ‘num_samples’, it defines the second shuffle range [num_samples, total_size)

  • numpy_random_state (numpy.random.RandomState) – The NumPy random state

Returns:

The shuffle index

Return type:

numpy.ndarray

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._get_ltor_masks_and_position_ids(
data: torch.Tensor,
eod_token: int,
reset_position_ids: bool,
reset_attention_mask: bool,
eod_mask_loss: bool,
create_attention_mask: bool,
)#

Build masks and position id for left to right model.

Parameters:
  • data (torch.Tensor) – The data tenor that holds the tokens from the dataset

  • eod_token (int) – ID of the token to that is considered the EOD

  • reset_position_ids (bool) – Switch to reset the document position ID’s

  • reset_attention_mask (bool) – Switch to reset the attention mask

  • eod_mask_loss (bool) – Switch to enable the EOD mask loss

  • create_attention_mask (bool) – Switch to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.

Returns:

Attention mask needed to be used for Attention

torch.Tensor: The mask used for loss value during training

torch.Tensor: The position ID’s of the token

Return type:

torch.Tensor

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.normalize(weights: list[float]) list[float]#

Do non-exponentiated normalization

Parameters:

weights (List[float]) – The weights

Returns:

The normalized weights

Return type:

List[float]