nemo_automodel.components.datasets.llm.megatron.gpt_dataset

Module Contents

Classes

Name	Description
`BlendedMegatronDatasetConfig`	Configuration object for Megatron Core datasets
`GPTDataset`	The base GPT dataset
`GPTDatasetConfig`	Configuration object for Megatron Core GPT datasets
`Split`	Dataset split identifiers used by Megatron GPT datasets.

Functions

Name	Description
`_build_document_index`	Build an array with length = num epochs * num documents
`_build_shuffle_index`	Build the range [0, size) and shuffle
`_get_ltor_masks_and_position_ids`	Build masks and position id for left to right model.
`convert_split_vector_to_split_matrix`	Build the split matrix from one or optionally two contributing split vectors.
`normalize`	Do non-exponentiated normalization
`parse_and_normalize_split`	Parse the dataset split ratios from a string

Data

_PAD_TOKEN_ID

logger

API

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig(
    random_seed: int,
    sequence_length: int,
    blend: typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]] = None,
    blend_per_split: typing.Optional[typing.List[typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]]]] = None,
    split: typing.Optional[str] = None,
    num_dataset_builder_threads: int = 1,
    path_to_cache: typing.Optional[str] = None,
    mmap_bin_files: bool = True,
    tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
    mid_level_dataset_surplus: float = 0.005,
    object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None
)

Dataclass

Configuration object for Megatron Core datasets

blend

Optional[Tuple[List[str], Optional[List[float]]]] = None

The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights. For example, [[“dataset-path1”, “dataset-path2”], [0.3, 0.7]]. When the weights are None, they are inferred from the lengths of the contributing datasets. Not to be used with ‘blend_per_split’. Defaults to None.

blend_per_split

Optional[List[Optional[Tuple[List[str], Optional[List[float]]]]]] = None

A set of blends, as defined above, one for each split distribution. Not to be used with ‘blend’. Defauls to None.

mid_level_dataset_surplus

float = 0.005

The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005. This value is irrelevant for single source data blends. This value may need to be increased if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0 in future if the top level dataset is constrained to not oversample the mid level datasets(s).

mmap_bin_files

bool = True

Whether to mmap the .bin files or use file pointers.

mock

bool = field(init=False, default=False)

Whether to bypass real data loading and validation in favor of mock data generation. Created automatically from ‘blend’ and ‘blend_per_split’. Not to be passed in to the constructor.

num_dataset_builder_threads

int = 1

The number of threads to use for dataset building.

object_storage_config

Optional[ObjectStorageConfig] = None

When set, the .idx files are downloaded to path_to_idx_cache and .bin files are streamed from S3/MSC via chunked GETs. mmap_bin_files is automatically overridden to False.

path_to_cache

Optional[str] = None

Where all re-useable dataset indices are to be cached.

random_seed

int

The seed for all RNG during dataset creation.

sequence_length

int

The sequence length.

split

Optional[str] = None

The split string, a comma separated weighting for the dataset splits when drawing samples from a single distribution. Not to be used with ‘blend_per_split’. Defaults to None.

split_matrix

Optional[List[Tuple[float, float]]] = field(init=False, default=None)

The split matrix consisting of non-overlapping book-ends of each split in order. For more information, refer to ‘convert_split_vector_to_split_matrix’. Created automatically from ‘split’. Not to be passed in to the constructor.

tokenizer

Optional[PreTrainedTokenizerBase] = None

The PreTrainedTokenizerBase instance. Required for datasets that do online tokenization.

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig.__post_init__() -> None

Do asserts and set fields post init

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset(
    indexed_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset,
    dataset_path: typing.Optional[str],
    indexed_indices: numpy.ndarray,
    num_samples: typing.Optional[int],
    index_split: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split,
    config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
)

Bases: Dataset

The base GPT dataset

Parameters:

indexed_dataset

IndexedDataset

The IndexedDataset around which to build the GPTDataset

dataset_path

Optional[str]

The real path on disk to the dataset, for bookkeeping

indexed_indices

numpy.ndarray

The set of the documents indices to expose

num_samples

Optional[int]

The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.

index_split

Split

The indexed_indices Split

config

GPTDatasetConfig

The config

_eos_token_id

_pad_eos_overlap

_pad_token_id

= self.config.tokenizer.pad_token_id

masks_and_position_ids_are_cacheable

unique_description

unique_description_hash

unique_identifiers

= OrderedDict()

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.__getitem__(
    idx: typing.Optional[int]
) -> dict[str, torch.Tensor]

Abstract method implementation

Parameters:

idx

Optioal[int]

The index into the dataset

Returns: dict[str, torch.Tensor]

dict[str, torch.Tensor]: The sample information wrapped in a dictionary

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.__len__() -> int

Abstract method implementation

Returns: int

The effective length of the dataset, capped by num_samples when provided

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._build_document_sample_shuffle_indices() -> typing.Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Build the document index, the sample index, and the shuffle index

Returns: numpy.ndarray

Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]: The document index, the sample

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._get_num_epochs(
    num_tokens_per_epoch: int
) -> int

Calculate the number of epochs

Parameters:

num_tokens_per_epoch

int

The number of tokens in a single epoch

Returns: int

The number of epochs

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._get_num_tokens_per_epoch() -> int

Calculate the number of tokens in a single epoch

Returns: int

The number of tokens in a single epoch

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._key_config_attributes() -> typing.List[str]

staticmethod

Return all config attributes which contribute to uniquely identifying the dataset.

These attributes will be used to build a uniquely identifying string and MD5 hash which will be used to cache/load dataset resources from run to run.

Returns: List[str]

List[str]: The key config attributes

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._query_document_sample_shuffle_indices(
    idx: int
) -> typing.Tuple[numpy.ndarray, numpy.ndarray]

Get the text (token ids) and document ids for a given index

Parameters:

idx

int

The index into the dataset

Returns: Tuple[numpy.ndarray, numpy.ndarray]

Tuple[numpy.ndarray, numpy.ndarray]: The text ids and document ids

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.build_low_level_dataset(
    dataset_path: str,
    config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
) -> nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset

staticmethod

Abstract method implementation

Parameters:

dataset_path

str

The real path prefix to the IndexedDataset .bin and .idx files

config

GPTDatasetConfig

The config

Returns: IndexedDataset

The underlying IndexedDataset

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.numel_low_level_dataset(
    low_level_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset
) -> int

staticmethod

Abstract method implementation

For GPT, the underlying IndexedDataset should be split by sequence, as opposed to, say, BERT, which should be split by document

Parameters:

low_level_dataset

IndexedDataset

The underlying IndexedDataset

Returns: int

The number of unique elements in the underlying IndexedDataset

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig(
    random_seed: int,
    sequence_length: int,
    blend: typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]] = None,
    blend_per_split: typing.Optional[typing.List[typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]]]] = None,
    split: typing.Optional[str] = None,
    num_dataset_builder_threads: int = 1,
    path_to_cache: typing.Optional[str] = None,
    mmap_bin_files: bool = True,
    tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
    mid_level_dataset_surplus: float = 0.005,
    object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None,
    reset_position_ids: typing.Optional[bool] = None,
    reset_attention_mask: typing.Optional[bool] = None,
    eod_mask_loss: typing.Optional[bool] = None,
    create_attention_mask: bool = True,
    drop_last_partial_validation_sequence: bool = True,
    add_extra_token_to_sequence: bool = True
)

Dataclass

Bases: BlendedMegatronDatasetConfig

Configuration object for Megatron Core GPT datasets

add_extra_token_to_sequence

bool = True

Option to draw sequences with one extra token to ensure the sample input tokens and sample output tokens are both of the desired sequence length

create_attention_mask

bool = True

Option to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.

drop_last_partial_validation_sequence

bool = True

Option to drop the last partial validation sequence

eod_mask_loss

Optional[bool] = None

Option to enable the EOD mask loss

reset_attention_mask

Optional[bool] = None

Option to reset the attention mask from the dataset

reset_position_ids

Optional[bool] = None

Option to reset the position IDs in the dataset at an interval

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig.__post_init__() -> None

Do asserts and set fields post init

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split

Bases: enum.Enum

Dataset split identifiers used by Megatron GPT datasets.

test

= 2

train

= 0

valid

= 1

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_document_index(
    documents: numpy.ndarray,
    num_epochs: int,
    numpy_random_state: numpy.random.RandomState,
    separate_final_epoch: bool
) -> numpy.ndarray

Build an array with length = num epochs * num documents

Parameters:

documents

numpy.ndarray

the subset of exposed document indices

num_epochs

int

The number of epochs

numpy_random_state

numpy.random.RandomState

The NumPy random state

separate_final_epoch

bool

Whether to exclude the last epoch from the global shuffle

Returns: numpy.ndarray

numpy.ndarray: The document index

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_shuffle_index(
    num_samples: int,
    total_size: int,
    numpy_random_state: numpy.random.RandomState
) -> numpy.ndarray

Build the range [0, size) and shuffle

Parameters:

num_samples

int

The size of the first shuffle range [0, num_samples)

total_size

int

The size of the entire index. If larger than ‘num_samples’, it defines the second shuffle range [num_samples, total_size)

numpy_random_state

numpy.random.RandomState

The NumPy random state

Returns: numpy.ndarray

numpy.ndarray: The shuffle index

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._get_ltor_masks_and_position_ids(
    data: torch.Tensor,
    eod_token: int,
    reset_position_ids: bool,
    reset_attention_mask: bool,
    eod_mask_loss: bool,
    create_attention_mask: bool
)

Build masks and position id for left to right model.

Parameters:

data

torch.Tensor

The data tenor that holds the tokens from the dataset

eod_token

int

ID of the token to that is considered the EOD

reset_position_ids

bool

Switch to reset the document position ID’s

reset_attention_mask

bool

Switch to reset the attention mask

eod_mask_loss

bool

Switch to enable the EOD mask loss

create_attention_mask

bool

Switch to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.

Returns:

torch.Tensor: Attention mask needed to be used for Attention

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.convert_split_vector_to_split_matrix(
    vector_a: typing.List[float],
    vector_b: typing.Optional[typing.List[float]] = None
) -> typing.List[typing.Optional[typing.Tuple[float, float]]]

Build the split matrix from one or optionally two contributing split vectors.

Ex. a standard conversion:

[0.99, 0.01, 0.0] -> [(0, 0.99), (0.99, 1.0), None]

Ex. a conversion for Retro when Retro pretraining uses a [0.99, 0.01, 0.0] split and Retro preprocessing used a [0.98, 0.02, 0.0] split:

[0.99, 0.01, 0.0], [0.98, 0.02, 0.0] -> [(0, 0.98), (0.99, 1.0), None]

Parameters:

vector_a

List[float]

The primary split vector

vector_b

Optional[List[float]]Defaults to None

An optional secondary split vector which constrains the primary split vector. Defaults to None.

Returns: List[Optional[Tuple[float, float]]]

List[Tuple[float, float]]: The split matrix consisting of book-ends of each split in order

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.normalize(
    weights: list[float]
) -> list[float]

Do non-exponentiated normalization

Parameters:

weights

List[float]

The weights

Returns: list[float]

List[float]: The normalized weights

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.parse_and_normalize_split(
    split: str
) -> typing.List[float]

Parse the dataset split ratios from a string

Parameters:

split

str

The train valid test split string e.g. “99,1,0”

Returns: List[float]

List[float]: The trian valid test split ratios e.g. [0.99, 0.01, 0.0]

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._PAD_TOKEN_ID = -100

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.logger = logging.getLogger(__name__)