nemo_automodel.components.datasets.llm.megatron.builder

View as Markdown

Module Contents

Classes

NameDescription
BlendedDatasetConjugating class for a set of MegatronDataset instances
BlendedMegatronDatasetBuilderBuilder class for the BlendedDataset and MegatronDataset classes

Functions

NameDescription
_get_size_per_split_per_datasetDetermine the contribution of the MegatronDataset splits to the BlendedDataset splits

Data

_VERBOSE

logger

API

class nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset(
datasets: typing.List[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset],
weights: typing.List[typing.Union[int, float]],
size: typing.Optional[int],
config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
)

Bases: Dataset

Conjugating class for a set of MegatronDataset instances

Parameters:

datasets
List[MegatronDataset]

The MegatronDataset instances to blend

weights
List[Union[int, float]]

The weights that determine the dataset blend ratios

size
Optional[int]

The number of samples to draw from the blend. If None, for each dataset index idx draw exactly weights[idx] samples from datasets[idx].

config
BlendedMegatronDatasetConfig

The config

Raises:

  • RuntimeError: When the dataset has fewer or more samples than ‘size’ post-initialization
split
= self.datasets[0].index_split
unique_description
unique_description_hash
nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset.__getitem__(
idx: int
) -> typing.Dict[str, typing.Union[int, numpy.ndarray]]
nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset.__len__() -> int
nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset._build_indices() -> typing.Tuple[numpy.ndarray, numpy.ndarray]

Build and optionally cache the dataset index and the dataset sample index

The dataset index is a 1-D mapping which determines the dataset to query. The dataset sample index is a 1-D mapping which determines the sample to request from the queried dataset.

Returns: Tuple[numpy.ndarray, numpy.ndarray]

Tuple[numpy.ndarray, numpy.ndarray]: The dataset index and the dataset sample index

class nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder(
sizes: list[int],
is_built_on_rank: typing.Callable,
config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig,
enabled_splits: typing.Optional[list[str]] = None
)

Builder class for the BlendedDataset and MegatronDataset classes

Args:

sizes (List[Optional[int]]): The minimum total number of samples to draw, or None, per split

is_built_on_rank (Callable): A callable which returns True if the dataset should be built on the current rank and False otherwise. It should be Megatron Core parallelism aware i.e. global rank, local group rank, and virtual rank may inform its return value.

config (BlendedMegatronDatasetConfig): The config object which informs dataset creation

nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._build_blended_dataset_splits() -> typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]

Build all dataset splits according to the provided blend(s)

See the BlendedMegatronDatasetBuilder.build alias for more information.

Returns: List[Optional[GPTDataset]]

List[Optional[GPTDataset]]: A list containing a dataset instance (or None) per split

nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._build_megatron_dataset_splits(
dataset_path: typing.Optional[str],
split: typing.List[float],
sizes: typing.List[int],
synchronize_ranks: bool = True
) -> typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]

Build each MidLevelDataset split from a single LowLevelDataset

Parameters:

dataset_path
Optional[str]

The path on disk which defines the underlying LowLevelDataset, or None for mock dataset classes

split
List[Tuple[float, float]]

The dataset split matrix

sizes
List[int]

The number of total samples to draw from each split

synchronize_ranks
boolDefaults to True

Whether to call barrier for rank-0 / barrier / other-ranks behavior. Set to False when we enforce this behavior at higher level.

Returns: List[Optional[GPTDataset]]

List[Optional[GPTDataset]]: The GPTDataset (or None) per split

nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._build_megatron_datasets_parallel(
prefixes: typing.List[str],
split: typing.List[float],
sizes_per_dataset: typing.List[typing.List[int]]
) -> typing.List[typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]]

Build the megatron datasets for a list of prefixes in parallel

Parameters:

prefixes
List[str]

The list of prefix strings

split
List[float]

The dataset split ratios (must sum to 1.00)

sizes_per_dataset
List[List[int]]

The number of samples to request

Returns: List[List[Optional[GPTDataset]]]

List[List[Optional[GPTDataset]]]: For each split, have a list of

nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._is_enabled_index(
idx: int
) -> bool

Return True if a given split index should be built.

If no enabled_splits were provided, all splits are enabled.

nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._masked_split_matrix(
split_matrix: typing.List[typing.Optional[tuple]]
) -> typing.List[typing.Optional[tuple]]

Mask splits that are not enabled by setting their bookends to None.

This preserves the original split ratios while skipping construction for disabled splits.

nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder.build() -> typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]

Build all dataset splits according to the provided blend(s)

This method is distributed-aware and must be called on all ranks.

The dataset splits returned can vary according to the config. Supply config.blend and config.split to build BlendedDataset and/or MegatronDataset splits from the same distribution. Supply config.blend_per_split to build BlendedDataset and/or MegatronDataset splits from separate distributions. In either case, for each split, handle the following cases:

(1) The split is None

  • do nothing

(2) The split has one contributing dataset, and…

(a) ‘size’ is not None

  • Build a mid-level dataset with low-level dataset sampling in proportion to the size

(b) ‘size’ is None

  • Build mid-level datasets with no excess low-level dataset sampling

(3) The split has multiple contributing datasets, and…

(a) ‘weights’ is not None and ‘size’ is not None

  • Build mid-level datasets with low-level dataset sampling in proportion to their weights and the size
  • Build a top-level dataset of length marginally greater than ‘size’ with mid-level dataset sampling in proportion to their weights and the size

(b) ‘weights’ is not None and ‘size’ is None

  • Error

(c) ‘weights’ is None and ‘size’ is not None

  • Build mid-level datasets with no excess low-level dataset sampling
  • Build a top-level dataset of length ‘size’ (capped at the sum of the mid-level dataset lengths) with mid-level dataset sampling in proportion to their lengths and the size

(d) ‘weights’ is None and ‘size’ is None

  • Build mid-level datasets with no excess low-level dataset sampling
  • Build a top-level dataset with no excess mid-level dataset sampling

Returns: List[Optional[GPTDataset]]

List[Optional[GPTDataset]]: A list containing a dataset instance (or None) per split

nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder.build_generic_dataset(
is_built_on_rank: typing.Callable,
synchronize_ranks: bool,
args: typing.Any = ()
) -> typing.Optional[typing.Union[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset | nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset, typing.Iterable]]
staticmethod

Build the GPTDataset or BlendedDataset

Return None if and only if the underlying dataset class is not built on the current rank and torch.distributed is initialized.

Parameters:

cls
Union[Type[GPTDataset | BlendedDataset], Callable]

The GPTDataset or BlendedDataset class to be built. In special cases, e.g. when we are building the low level dataset for a RawMegatronDataset instance, we can accept a Callable which returns an Iterable.

synchronize_ranks
bool

Whether to call barrier for rank-0 / barrier / other-ranks behavior. Set to False when we enforce this behavior at higher level.

args
Tuple[Any]Defaults to ()

The positional arguments used to build the provided GPTDataset or BlendedDataset class

Returns: Optional[Union[GPTDataset | BlendedDataset, Iterable]]

Optional[Union[GPTDataset | BlendedDataset, Iterable]]: The GPTDataset or BlendedDataset instantion, the Iterable instantiation, or None

Raises:

  • Exception: When the dataset constructor raises an OSError
nemo_automodel.components.datasets.llm.megatron.builder._get_size_per_split_per_dataset(
normalized_weights: typing.List[float],
target_size_per_split: typing.List[int],
surplus: float = 0.0
) -> typing.List[typing.List[int]]

Determine the contribution of the MegatronDataset splits to the BlendedDataset splits

Parameters:

normalized_weights
List[float]

e.g. [0.3, 0.7]

target_size_per_split
List[int]

The number of samples to target for each BlendedDataset split

surplus
floatDefaults to 0.0

The sample surplus to build per split per dataset

Returns: List[List[int]]

List[List[int]]: The number of samples to request per MegatronDataset per split

nemo_automodel.components.datasets.llm.megatron.builder._VERBOSE = False
nemo_automodel.components.datasets.llm.megatron.builder.logger = logging.getLogger(__name__)