nemo_automodel.components.datasets.llm.length_grouped_sampler
nemo_automodel.components.datasets.llm.length_grouped_sampler
Length-grouped sampler for LLM training.
Groups samples by token count so that batches contain similar-length
sequences, minimizing padding waste. Adapted from the VLM
LengthGroupedSampler but simplified for text-only datasets.
Usage::
sampler = LengthGroupedSampler( dataset=ds, batch_size=4, seed=42, num_replicas=world_size, rank=rank, ) dataloader = DataLoader(dataset, sampler=sampler, batch_size=4)
Module Contents
Classes
Data
API
Bases: Sampler[int]
Sampler that groups samples by sequence length for balanced batches.
Sorts samples by length, chunks into groups of batch_size, then
shuffles at the chunk level each epoch. This preserves intra-batch
length similarity (less padding) while adding per-epoch randomness.
For distributed training, each rank gets an interleaved shard of the
sorted indices. All ranks use the same seed + epoch so chunk K
on every rank corresponds to similar-length samples, keeping
cross-rank padding minimal.
Parameters:
The dataset to sample from. Samples must have an
input_ids key (list or tensor) whose length is used
for sorting.
Local batch size per rank.
Base random seed (must be the same on all ranks).
Number of distributed ranks (default: world size).
This rank’s index (default: current rank).
Drop the tail indices that don’t fill a full batch across all ranks.
Compute token lengths for all samples.
Set the epoch for deterministic per-epoch shuffling.