core.datasets.blended_megatron_dataset_builder#

Module Contents#

Classes#

BlendedMegatronDatasetBuilder

Builder class for the BlendedDataset and MegatronDataset classes

Functions#

_get_size_per_split_per_dataset

Determine the contribution of the MegatronDataset splits to the BlendedDataset splits

Data#

API#

core.datasets.blended_megatron_dataset_builder.logger#

‘getLogger(…)’

core.datasets.blended_megatron_dataset_builder.MidLevelDataset#

None

core.datasets.blended_megatron_dataset_builder.TopLevelDataset#

None

core.datasets.blended_megatron_dataset_builder.DistributedDataset#

None

class core.datasets.blended_megatron_dataset_builder.BlendedMegatronDatasetBuilder(
cls: Type[core.datasets.blended_megatron_dataset_builder.MidLevelDataset],
sizes: List[int],
is_built_on_rank: Callable,
config: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig,
)#

Bases: object

Builder class for the BlendedDataset and MegatronDataset classes

Parameters:
  • cls (Type[MegatronDataset]) – The class to instantiate, must inherit from MegatronDataset

  • sizes (List[Optional[int]]) – The minimum total number of samples to draw, or None, per split

  • is_built_on_rank (Callable) – A callable which returns True if the dataset should be built on the current rank and False otherwise. It should be Megatron Core parallelism aware i.e. global rank, local group rank, and virtual rank may inform its return value. Should return true for exactly one process on global rank 0.

  • config (BlendedMegatronDatasetConfig) – The config object which informs dataset creation

Initialization

build() List[Optional[core.datasets.blended_megatron_dataset_builder.TopLevelDataset]]#

Build all dataset splits according to the provided blend(s)

This method is distributed-aware and must be called on all ranks.

The dataset splits returned can vary according to the config. Supply config.blend and config.split to build BlendedDataset and/or MegatronDataset splits from the same distribution. Supply config.blend_per_split to build BlendedDataset and/or MegatronDataset splits from separate distributions. In either case, for each split, handle the following cases:

(1) The split is None - do nothing

(2) The split has one contributing dataset, and…

(a) 'size' is not None
    - Build a mid-level dataset with low-level dataset sampling in proportion to the
    size

(b) 'size' is None
    - Build mid-level datasets with no excess low-level dataset sampling

(3) The split has multiple contributing datasets, and…

(a) 'weights' is not None and 'size' is not None
    - Build mid-level datasets with low-level dataset sampling in proportion to their
    weights and the size
    - Build a top-level dataset of length marginally greater than 'size' with mid-level
    dataset sampling in proportion to their weights and the size

(b) 'weights' is not None and 'size' is None
    - Error

(c) 'weights' is None and 'size' is not None
    - Build mid-level datasets with no excess low-level dataset sampling
    - Build a top-level dataset of length 'size' (capped at the sum of the mid-level
    dataset lengths) with mid-level dataset sampling in proportion to their lengths
    and the size

(d) 'weights' is None and 'size' is None
    - Build mid-level datasets with no excess low-level dataset sampling
    - Build a top-level dataset with no excess mid-level dataset sampling
Returns:

A list containing a dataset instance (or None) per split

Return type:

List[Optional[TopLevelDataset]]

_build_blended_dataset_splits() List[Optional[core.datasets.blended_megatron_dataset_builder.TopLevelDataset]]#

Build all dataset splits according to the provided blend(s)

See the BlendedMegatronDatasetBuilder.build alias for more information.

Returns:

A list containing a dataset instance (or None) per split

Return type:

List[Optional[TopLevelDataset]]

_build_megatron_datasets_parallel(
prefixes: List[str],
split: List[float],
sizes_per_dataset: List[List[int]],
) List[List[Optional[megatron.core.datasets.megatron_dataset.MegatronDataset]]]#

Build the megatron datasets for a list of prefixes in parallel

Parameters:
  • prefixes (List[str]) – The list of prefix strings

  • split (List[float]) – The dataset split ratios (must sum to 1.00)

  • sizes_per_dataset (List[List[int]]) – The number of samples to request

  • spilt (per MegatronDataset per)

Returns:

For each split, have a list of MegatronDataset per prefix

Return type:

List[List[Optional[MegatronDataset]]]

_build_megatron_dataset_splits(
dataset_path: Optional[str],
split: List[float],
sizes: List[int],
synchronize_ranks: bool = True,
) List[Optional[core.datasets.blended_megatron_dataset_builder.MidLevelDataset]]#

Build each MidLevelDataset split from a single LowLevelDataset

Parameters:
  • dataset_path (Optional[str]) – The path on disk which defines the underlying LowLevelDataset, or None for mock dataset classes

  • split (List[Tuple[float, float]]) – The dataset split matrix

  • sizes (List[int]) – The number of total samples to draw from each split

  • synchronize_ranks (bool) – Whether to call barrier for rank-0 / barrier / other-ranks behavior. Set to False when we enforce this behavior at higher level.

Returns:

The MidLevelDataset (or None) per split

Return type:

List[Optional[MidLevelDataset]]

static build_generic_dataset(
cls: Union[Type[core.datasets.blended_megatron_dataset_builder.DistributedDataset], Callable],
is_built_on_rank: Callable,
synchronize_ranks: bool,
*args: Any,
) Optional[Union[core.datasets.blended_megatron_dataset_builder.DistributedDataset, Iterable]]#

Build the DistributedDataset

Return None if and only if the underlying dataset class is not built on the current rank and torch.distributed is initialized.

Parameters:
  • cls (Union[Type[DistributedDataset], Callable]) – The DistributedDataset class to be built. In special cases, e.g. when we are building the low level dataset for a RawMegatronDataset instance, we can accept a Callable which returns an Iterable.

  • is_built_on_rank (Callable) – A callable which returns True if the dataset should be built on the current rank and False otherwise.

  • synchronize_ranks (bool) – Whether to call barrier for rank-0 / barrier / other-ranks behavior. Set to False when we enforce this behavior at higher level.

  • args (Tuple[Any]) – The positional arguments used to build the provided DistributedDataset class

Raises:

Exception – When the dataset constructor raises an OSError

Returns:

The DistributedDataset instantion, the Iterable instantiation, or None

Return type:

Optional[Union[DistributedDataset, Iterable]]

core.datasets.blended_megatron_dataset_builder._get_size_per_split_per_dataset(
normalized_weights: List[float],
target_size_per_split: List[int],
surplus: float = 0.0,
) List[List[int]]#

Determine the contribution of the MegatronDataset splits to the BlendedDataset splits

Parameters:
  • normalized_weights (List[float]) – e.g. [0.3, 0.7]

  • target_size_per_split (List[int]) – The number of samples to target for each BlendedDataset split

  • surplus (float) – The sample surplus to build per split per dataset

Returns:

The number of samples to request per MegatronDataset per split

Return type:

List[List[int]]