nemo_automodel.components.datasets.llm.megatron.sampler#

Module Contents#

Classes#

BaseMegatronSampler

Base class for Megatron batch samplers.

MegatronPretrainingSampler

Deterministic sequential sampler with per-rank slicing.

MegatronPretrainingRandomSampler

Randomized sampler with per-epoch shuffling and per-rank slicing.

Functions#

create_megatron_sampler

Factory for Megatron samplers.

API#

class nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler(
total_samples: int,
micro_batch_size: int,
data_parallel_rank: int,
data_parallel_size: int,
drop_last: bool = True,
global_batch_size: Optional[int] = None,
pad_samples_to_global_batch_size: Optional[bool] = False,
)#

Base class for Megatron batch samplers.

Provides common validation and shared behavior for Megatron samplers. Implementations must yield lists of dataset indices that correspond to one micro-batch for a single data-parallel rank.

Parameters:
  • total_samples – Total available samples in the dataset.

  • micro_batch_size – Number of samples per micro-batch on each data-parallel rank.

  • data_parallel_rank – Rank id in the data-parallel group that this sampler will serve.

  • data_parallel_size – World size of the data-parallel group.

  • drop_last – If True, drop incomplete batches. If False, implementations may yield a final partial micro-batch (subject to their constraints).

  • global_batch_size – Effective global batch size across all data-parallel ranks; when provided, length is computed in global-batch units and converted to micro-batches.

  • pad_samples_to_global_batch_size – If True and supported by the sampler, the last incomplete global batch will be padded to global_batch_size when drop_last is False.

Initialization

__len__()#

Return the number of micro-batches this sampler will yield.

If global_batch_size is provided, the length is computed in terms of global batches and converted to micro-batches to align with training loops that iterate by micro-batch.

abstract __iter__()#
class nemo_automodel.components.datasets.llm.megatron.sampler.MegatronPretrainingSampler(
total_samples: int,
micro_batch_size: int,
data_parallel_rank: int,
data_parallel_size: int,
drop_last: bool = True,
global_batch_size: Optional[int] = None,
pad_samples_to_global_batch_size: Optional[bool] = False,
)#

Bases: nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler

Deterministic sequential sampler with per-rank slicing.

Iterates deterministically over sample indices, splits each global batch across data-parallel ranks, and yields per-rank micro-batches. When drop_last is False and pad_samples_to_global_batch_size is True, the final global batch is padded to a full size so that all ranks emit complete micro-batches.

Raises:

RuntimeError – If there are no samples left to consume.

Initialization

get_start_end_idx()#

Return slice boundaries for this rank within a global batch.

Returns:

Tuple of (start_idx, end_idx) used to extract this rank’s micro-batch from a concatenated global batch buffer.

__iter__()#

Yield lists of indices forming per-rank micro-batches.

Iterates up to total_samples. Optionally pads the last global batch when drop_last is False and pad_samples_to_global_batch_size is True.

class nemo_automodel.components.datasets.llm.megatron.sampler.MegatronPretrainingRandomSampler(
total_samples: int,
micro_batch_size: int,
data_parallel_rank: int,
data_parallel_size: int,
drop_last: bool = True,
global_batch_size: Optional[int] = None,
pad_samples_to_global_batch_size: Optional[bool] = False,
seed: int = 0,
)#

Bases: nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler

Randomized sampler with per-epoch shuffling and per-rank slicing.

Uses a deterministic seed schedule seed + epoch to randomize indices within each data-parallel shard (bucket). Notably, this sampler:

  • Does not support padding the last global batch.

  • Requires drop_last=True when the product micro_batch_size * data_parallel_size > 1.

Initialization

__len__()#

Return the number of micro-batches that will be produced.

Accounts for drop_last by excluding a trailing incomplete global batch. When global_batch_size is provided, converts global batches to micro-batches.

__iter__()#

Yield randomized micro-batches for this rank.

Each epoch shuffles indices within the per-rank bucket using torch.randperm seeded by seed + epoch. The sampler then emits contiguous micro-batches of size micro_batch_size for this rank.

nemo_automodel.components.datasets.llm.megatron.sampler.create_megatron_sampler(
dataset_len: int,
micro_batch_size: int,
global_batch_size: int,
dataloader_type: Literal[single, cyclic] = 'single',
drop_last: bool = True,
pad_samples_to_global_batch_size: bool = False,
rank: int = 0,
world_size: int = 1,
) nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler#

Factory for Megatron samplers.

Constructs and returns a Megatron-compatible sampler for a dataset of a given length and batch configuration. The returned sampler yields lists of indices per micro-batch for a single data-parallel rank.

Parameters:
  • dataset_len – Number of samples in the underlying dataset.

  • micro_batch_size – Number of samples per micro-batch on each data-parallel rank.

  • global_batch_size – Effective global batch size across all data-parallel ranks (micro_batch_size * world_size * grad_accum).

  • dataloader_type –

    Sampler type to construct. Supported values:

    • ”single”: Deterministic sequential sampling (MegatronPretrainingSampler).

    • ”cyclic”: Randomized per-epoch sampling (MegatronPretrainingRandomSampler). The value β€œbatch” is not supported in this implementation.

  • drop_last – When True, drop a trailing incomplete batch.

  • pad_samples_to_global_batch_size – When True and supported by the sampler, pad the final global batch to global_batch_size if drop_last is False.

  • rank – Data-parallel rank id for this process.

  • world_size – Number of data-parallel ranks.

Returns:

Configured sampler instance for the requested type.

Return type:

BaseMegatronSampler

Raises:

Exception – If an unsupported dataloader_type is provided.