`nemo_automodel.components.datasets.llm.megatron.sampler`#

Module Contents#

Classes#

`BaseMegatronSampler`	Base class for Megatron batch samplers.
`MegatronPretrainingSampler`	Deterministic sequential sampler with per-rank slicing.
`MegatronPretrainingRandomSampler`	Randomized sampler with per-epoch shuffling and per-rank slicing.

Functions#

create_megatron_sampler

Factory for Megatron samplers.

API#

class nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler( total_samples: int, micro_batch_size: int, data_parallel_rank: int, data_parallel_size: int, drop_last: bool = True, global_batch_size: Optional[int] = None, pad_samples_to_global_batch_size: Optional[bool] = False, )#

Base class for Megatron batch samplers.

Provides common validation and shared behavior for Megatron samplers. Implementations must yield lists of dataset indices that correspond to one micro-batch for a single data-parallel rank.

Parameters:

total_samples – Total available samples in the dataset.
micro_batch_size – Number of samples per micro-batch on each data-parallel rank.
data_parallel_rank – Rank id in the data-parallel group that this sampler will serve.
data_parallel_size – World size of the data-parallel group.
drop_last – If True, drop incomplete batches. If False, implementations may yield a final partial micro-batch (subject to their constraints).
global_batch_size – Effective global batch size across all data-parallel ranks; when provided, length is computed in global-batch units and converted to micro-batches.
pad_samples_to_global_batch_size – If True and supported by the sampler, the last incomplete global batch will be padded to global_batch_size when drop_last is False.

Initialization

__len__()#

Return the number of micro-batches this sampler will yield.

If global_batch_size is provided, the length is computed in terms of global batches and converted to micro-batches to align with training loops that iterate by micro-batch.

abstractmethod __iter__()#

class nemo_automodel.components.datasets.llm.megatron.sampler.MegatronPretrainingSampler( total_samples: int, micro_batch_size: int, data_parallel_rank: int, data_parallel_size: int, drop_last: bool = True, global_batch_size: Optional[int] = None, pad_samples_to_global_batch_size: Optional[bool] = False, )#

Bases: nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler

Deterministic sequential sampler with per-rank slicing.

Iterates deterministically over sample indices, splits each global batch across data-parallel ranks, and yields per-rank micro-batches. When drop_last is False and pad_samples_to_global_batch_size is True, the final global batch is padded to a full size so that all ranks emit complete micro-batches.

Raises:: RuntimeError – If there are no samples left to consume.

Initialization

get_start_end_idx()#

Return slice boundaries for this rank within a global batch.

Returns:: Tuple of (start_idx, end_idx) used to extract this rank’s micro-batch from a concatenated global batch buffer.

__iter__()#

Yield lists of indices forming per-rank micro-batches.

Iterates up to total_samples. Optionally pads the last global batch when drop_last is False and pad_samples_to_global_batch_size is True.

class nemo_automodel.components.datasets.llm.megatron.sampler.MegatronPretrainingRandomSampler( total_samples: int, micro_batch_size: int, data_parallel_rank: int, data_parallel_size: int, drop_last: bool = True, global_batch_size: Optional[int] = None, pad_samples_to_global_batch_size: Optional[bool] = False, seed: int = 0, )#

Bases: nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler

Randomized sampler with per-epoch shuffling and per-rank slicing.

Uses a deterministic seed schedule seed + epoch to randomize indices within each data-parallel shard (bucket). Notably, this sampler:

Does not support padding the last global batch.
Requires drop_last=True when the product micro_batch_size * data_parallel_size > 1.

Initialization

__len__()#

Return the number of micro-batches that will be produced.

Accounts for drop_last by excluding a trailing incomplete global batch. When global_batch_size is provided, converts global batches to micro-batches.

__iter__()#

Yield randomized micro-batches for this rank.

Each epoch shuffles indices within the per-rank bucket using torch.randperm seeded by seed + epoch. The sampler then emits contiguous micro-batches of size micro_batch_size for this rank.

nemo_automodel.components.datasets.llm.megatron.sampler.create_megatron_sampler( dataset_len: int, micro_batch_size: int, global_batch_size: int, dataloader_type: Literal[single, cyclic] = 'single', drop_last: bool = True, pad_samples_to_global_batch_size: bool = False, rank: int = 0, world_size: int = 1, ) → nemo_automodel.components.datasets.llm.megatron.sampler.BaseMegatronSampler#

Factory for Megatron samplers.

Constructs and returns a Megatron-compatible sampler for a dataset of a given length and batch configuration. The returned sampler yields lists of indices per micro-batch for a single data-parallel rank.

Parameters:

dataset_len – Number of samples in the underlying dataset.
micro_batch_size – Number of samples per micro-batch on each data-parallel rank.
global_batch_size – Effective global batch size across all data-parallel ranks (micro_batch_size * world_size * grad_accum).
dataloader_type –
Sampler type to construct. Supported values:
- ”single”: Deterministic sequential sampling (MegatronPretrainingSampler).
- ”cyclic”: Randomized per-epoch sampling (MegatronPretrainingRandomSampler). The value “batch” is not supported in this implementation.
drop_last – When True, drop a trailing incomplete batch.
pad_samples_to_global_batch_size – When True and supported by the sampler, pad the final global batch to global_batch_size if drop_last is False.
rank – Data-parallel rank id for this process.
world_size – Number of data-parallel ranks.

Returns:

Configured sampler instance for the requested type.

Return type:

BaseMegatronSampler

Raises:

Exception – If an unsupported dataloader_type is provided.

nemo_automodel.components.datasets.llm.megatron.sampler#

Module Contents#

Classes#

Functions#

API#

`nemo_automodel.components.datasets.llm.megatron.sampler`#