nemo_automodel.components.datasets.llm.megatron.gpt_dataset

View as Markdown

Module Contents

Classes

NameDescription
BlendedMegatronDatasetConfigConfiguration object for Megatron Core datasets
GPTDatasetThe base GPT dataset
GPTDatasetConfigConfiguration object for Megatron Core GPT datasets
SplitDataset split identifiers used by Megatron GPT datasets.

Functions

NameDescription
_build_document_indexBuild an array with length = num epochs * num documents
_build_shuffle_indexBuild the range [0, size) and shuffle
_get_ltor_masks_and_position_idsBuild masks and position id for left to right model.
convert_split_vector_to_split_matrixBuild the split matrix from one or optionally two contributing split vectors.
normalizeDo non-exponentiated normalization
parse_and_normalize_splitParse the dataset split ratios from a string

Data

_PAD_TOKEN_ID

logger

API

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig(
random_seed: int,
sequence_length: int,
blend: typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]] = None,
blend_per_split: typing.Optional[typing.List[typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]]]] = None,
split: typing.Optional[str] = None,
num_dataset_builder_threads: int = 1,
path_to_cache: typing.Optional[str] = None,
mmap_bin_files: bool = True,
tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
mid_level_dataset_surplus: float = 0.005,
object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None
)
Dataclass

Configuration object for Megatron Core datasets

blend
Optional[Tuple[List[str], Optional[List[float]]]] = None

The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights. For example, [[“dataset-path1”, “dataset-path2”], [0.3, 0.7]]. When the weights are None, they are inferred from the lengths of the contributing datasets. Not to be used with ‘blend_per_split’. Defaults to None.

blend_per_split
Optional[List[Optional[Tuple[List[str], Optional[List[float]]]]]] = None

A set of blends, as defined above, one for each split distribution. Not to be used with ‘blend’. Defauls to None.

mid_level_dataset_surplus
float = 0.005

The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005. This value is irrelevant for single source data blends. This value may need to be increased if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0 in future if the top level dataset is constrained to not oversample the mid level datasets(s).

mmap_bin_files
bool = True

Whether to mmap the .bin files or use file pointers.

mock
bool = field(init=False, default=False)

Whether to bypass real data loading and validation in favor of mock data generation. Created automatically from ‘blend’ and ‘blend_per_split’. Not to be passed in to the constructor.

num_dataset_builder_threads
int = 1

The number of threads to use for dataset building.

object_storage_config
Optional[ObjectStorageConfig] = None

When set, the .idx files are downloaded to path_to_idx_cache and .bin files are streamed from S3/MSC via chunked GETs. mmap_bin_files is automatically overridden to False.

path_to_cache
Optional[str] = None

Where all re-useable dataset indices are to be cached.

random_seed
int

The seed for all RNG during dataset creation.

sequence_length
int

The sequence length.

split
Optional[str] = None

The split string, a comma separated weighting for the dataset splits when drawing samples from a single distribution. Not to be used with ‘blend_per_split’. Defaults to None.

split_matrix
Optional[List[Tuple[float, float]]] = field(init=False, default=None)

The split matrix consisting of non-overlapping book-ends of each split in order. For more information, refer to ‘convert_split_vector_to_split_matrix’. Created automatically from ‘split’. Not to be passed in to the constructor.

tokenizer
Optional[PreTrainedTokenizerBase] = None

The PreTrainedTokenizerBase instance. Required for datasets that do online tokenization.

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig.__post_init__() -> None

Do asserts and set fields post init

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset(
indexed_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset,
dataset_path: typing.Optional[str],
indexed_indices: numpy.ndarray,
num_samples: typing.Optional[int],
index_split: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split,
config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
)

Bases: Dataset

The base GPT dataset

Parameters:

indexed_dataset
IndexedDataset

The IndexedDataset around which to build the GPTDataset

dataset_path
Optional[str]

The real path on disk to the dataset, for bookkeeping

indexed_indices
numpy.ndarray

The set of the documents indices to expose

num_samples
Optional[int]

The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.

index_split
Split

The indexed_indices Split

config
GPTDatasetConfig

The config

_eos_token_id
_pad_eos_overlap
_pad_token_id
= self.config.tokenizer.pad_token_id
masks_and_position_ids_are_cacheable
unique_description
unique_description_hash
unique_identifiers
= OrderedDict()
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.__getitem__(
idx: typing.Optional[int]
) -> dict[str, torch.Tensor]

Abstract method implementation

Parameters:

idx
Optioal[int]

The index into the dataset

Returns: dict[str, torch.Tensor]

dict[str, torch.Tensor]: The sample information wrapped in a dictionary

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.__len__() -> int

Abstract method implementation

Returns: int

The effective length of the dataset, capped by num_samples when provided

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._build_document_sample_shuffle_indices() -> typing.Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Build the document index, the sample index, and the shuffle index

Returns: numpy.ndarray

Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]: The document index, the sample

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._get_num_epochs(
num_tokens_per_epoch: int
) -> int

Calculate the number of epochs

Parameters:

num_tokens_per_epoch
int

The number of tokens in a single epoch

Returns: int

The number of epochs

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._get_num_tokens_per_epoch() -> int

Calculate the number of tokens in a single epoch

Returns: int

The number of tokens in a single epoch

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._key_config_attributes() -> typing.List[str]
staticmethod

Return all config attributes which contribute to uniquely identifying the dataset.

These attributes will be used to build a uniquely identifying string and MD5 hash which will be used to cache/load dataset resources from run to run.

Returns: List[str]

List[str]: The key config attributes

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._query_document_sample_shuffle_indices(
idx: int
) -> typing.Tuple[numpy.ndarray, numpy.ndarray]

Get the text (token ids) and document ids for a given index

Parameters:

idx
int

The index into the dataset

Returns: Tuple[numpy.ndarray, numpy.ndarray]

Tuple[numpy.ndarray, numpy.ndarray]: The text ids and document ids

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.build_low_level_dataset(
dataset_path: str,
config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
) -> nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset
staticmethod

Abstract method implementation

Parameters:

dataset_path
str

The real path prefix to the IndexedDataset .bin and .idx files

config
GPTDatasetConfig

The config

Returns: IndexedDataset

The underlying IndexedDataset

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.numel_low_level_dataset(
low_level_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset
) -> int
staticmethod

Abstract method implementation

For GPT, the underlying IndexedDataset should be split by sequence, as opposed to, say, BERT, which should be split by document

Parameters:

low_level_dataset
IndexedDataset

The underlying IndexedDataset

Returns: int

The number of unique elements in the underlying IndexedDataset

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig(
random_seed: int,
sequence_length: int,
blend: typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]] = None,
blend_per_split: typing.Optional[typing.List[typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]]]] = None,
split: typing.Optional[str] = None,
num_dataset_builder_threads: int = 1,
path_to_cache: typing.Optional[str] = None,
mmap_bin_files: bool = True,
tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
mid_level_dataset_surplus: float = 0.005,
object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None,
reset_position_ids: typing.Optional[bool] = None,
reset_attention_mask: typing.Optional[bool] = None,
eod_mask_loss: typing.Optional[bool] = None,
create_attention_mask: bool = True,
drop_last_partial_validation_sequence: bool = True,
add_extra_token_to_sequence: bool = True
)
Dataclass

Bases: BlendedMegatronDatasetConfig

Configuration object for Megatron Core GPT datasets

add_extra_token_to_sequence
bool = True

Option to draw sequences with one extra token to ensure the sample input tokens and sample output tokens are both of the desired sequence length

create_attention_mask
bool = True

Option to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.

drop_last_partial_validation_sequence
bool = True

Option to drop the last partial validation sequence

eod_mask_loss
Optional[bool] = None

Option to enable the EOD mask loss

reset_attention_mask
Optional[bool] = None

Option to reset the attention mask from the dataset

reset_position_ids
Optional[bool] = None

Option to reset the position IDs in the dataset at an interval

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig.__post_init__() -> None

Do asserts and set fields post init

class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split

Bases: enum.Enum

Dataset split identifiers used by Megatron GPT datasets.

test
= 2
train
= 0
valid
= 1
nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_document_index(
documents: numpy.ndarray,
num_epochs: int,
numpy_random_state: numpy.random.RandomState,
separate_final_epoch: bool
) -> numpy.ndarray

Build an array with length = num epochs * num documents

Parameters:

documents
numpy.ndarray

the subset of exposed document indices

num_epochs
int

The number of epochs

numpy_random_state
numpy.random.RandomState

The NumPy random state

separate_final_epoch
bool

Whether to exclude the last epoch from the global shuffle

Returns: numpy.ndarray

numpy.ndarray: The document index

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_shuffle_index(
num_samples: int,
total_size: int,
numpy_random_state: numpy.random.RandomState
) -> numpy.ndarray

Build the range [0, size) and shuffle

Parameters:

num_samples
int

The size of the first shuffle range [0, num_samples)

total_size
int

The size of the entire index. If larger than ‘num_samples’, it defines the second shuffle range [num_samples, total_size)

numpy_random_state
numpy.random.RandomState

The NumPy random state

Returns: numpy.ndarray

numpy.ndarray: The shuffle index

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._get_ltor_masks_and_position_ids(
data: torch.Tensor,
eod_token: int,
reset_position_ids: bool,
reset_attention_mask: bool,
eod_mask_loss: bool,
create_attention_mask: bool
)

Build masks and position id for left to right model.

Parameters:

data
torch.Tensor

The data tenor that holds the tokens from the dataset

eod_token
int

ID of the token to that is considered the EOD

reset_position_ids
bool

Switch to reset the document position ID’s

reset_attention_mask
bool

Switch to reset the attention mask

eod_mask_loss
bool

Switch to enable the EOD mask loss

create_attention_mask
bool

Switch to enable the attention masks generation. Can be disabled if attention kernel generates masks by itself.

Returns:

torch.Tensor: Attention mask needed to be used for Attention

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.convert_split_vector_to_split_matrix(
vector_a: typing.List[float],
vector_b: typing.Optional[typing.List[float]] = None
) -> typing.List[typing.Optional[typing.Tuple[float, float]]]

Build the split matrix from one or optionally two contributing split vectors.

Ex. a standard conversion:

[0.99, 0.01, 0.0] -> [(0, 0.99), (0.99, 1.0), None]

Ex. a conversion for Retro when Retro pretraining uses a [0.99, 0.01, 0.0] split and Retro preprocessing used a [0.98, 0.02, 0.0] split:

[0.99, 0.01, 0.0], [0.98, 0.02, 0.0] -> [(0, 0.98), (0.99, 1.0), None]

Parameters:

vector_a
List[float]

The primary split vector

vector_b
Optional[List[float]]Defaults to None

An optional secondary split vector which constrains the primary split vector. Defaults to None.

Returns: List[Optional[Tuple[float, float]]]

List[Tuple[float, float]]: The split matrix consisting of book-ends of each split in order

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.normalize(
weights: list[float]
) -> list[float]

Do non-exponentiated normalization

Parameters:

weights
List[float]

The weights

Returns: list[float]

List[float]: The normalized weights

nemo_automodel.components.datasets.llm.megatron.gpt_dataset.parse_and_normalize_split(
split: str
) -> typing.List[float]

Parse the dataset split ratios from a string

Parameters:

split
str

The train valid test split string e.g. “99,1,0”

Returns: List[float]

List[float]: The trian valid test split ratios e.g. [0.99, 0.01, 0.0]

nemo_automodel.components.datasets.llm.megatron.gpt_dataset._PAD_TOKEN_ID = -100
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.logger = logging.getLogger(__name__)