> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.length_grouped_sampler

Length-grouped sampler for LLM training.

Groups samples by token count so that batches contain similar-length
sequences, minimizing padding waste.  Adapted from the VLM
`LengthGroupedSampler` but simplified for text-only datasets.

Usage::

sampler = LengthGroupedSampler(
dataset=ds,
batch\_size=4,
seed=42,
num\_replicas=world\_size,
rank=rank,
)
dataloader = DataLoader(dataset, sampler=sampler, batch\_size=4)

## Module Contents

### Classes

| Name                                                                                                          | Description                                                          |
| ------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| [`LengthGroupedSampler`](#nemo_automodel-components-datasets-llm-length_grouped_sampler-LengthGroupedSampler) | Sampler that groups samples by sequence length for balanced batches. |

### Data

[`logger`](#nemo_automodel-components-datasets-llm-length_grouped_sampler-logger)

### API

```python
class nemo_automodel.components.datasets.llm.length_grouped_sampler.LengthGroupedSampler(
    dataset: torch.utils.data.Dataset,
    batch_size: int = 1,
    seed: int = 42,
    num_replicas: int | None = None,
    rank: int | None = None,
    drop_last: bool = True
)
```

**Bases:** `Sampler[int]`

Sampler that groups samples by sequence length for balanced batches.

Sorts samples by length, chunks into groups of `batch_size`, then
shuffles at the chunk level each epoch.  This preserves intra-batch
length similarity (less padding) while adding per-epoch randomness.

For distributed training, each rank gets an interleaved shard of the
sorted indices.  All ranks use the same `seed + epoch` so chunk *K*
on every rank corresponds to similar-length samples, keeping
cross-rank padding minimal.

**Parameters:**

The dataset to sample from.  Samples must have an
`input_ids` key (list or tensor) whose length is used
for sorting.

Local batch size per rank.

Base random seed (must be the same on all ranks).

Number of distributed ranks (default: world size).

This rank's index (default: current rank).

Drop the tail indices that don't fill a full batch
across all ranks.

```python
nemo_automodel.components.datasets.llm.length_grouped_sampler.LengthGroupedSampler.__iter__() -> typing.Iterator[int]
```

```python
nemo_automodel.components.datasets.llm.length_grouped_sampler.LengthGroupedSampler.__len__() -> int
```

```python
nemo_automodel.components.datasets.llm.length_grouped_sampler.LengthGroupedSampler._compute_lengths(
    dataset: torch.utils.data.Dataset
) -> list[int]
```

staticmethod

Compute token lengths for all samples.

```python
nemo_automodel.components.datasets.llm.length_grouped_sampler.LengthGroupedSampler.load_state_dict(
    state_dict: typing.Dict[str, typing.Any]
) -> None
```

```python
nemo_automodel.components.datasets.llm.length_grouped_sampler.LengthGroupedSampler.set_epoch(
    epoch: int
) -> None
```

Set the epoch for deterministic per-epoch shuffling.

```python
nemo_automodel.components.datasets.llm.length_grouped_sampler.LengthGroupedSampler.state_dict() -> typing.Dict[str, typing.Any]
```

```python
nemo_automodel.components.datasets.llm.length_grouped_sampler.logger = logging.getLogger(__name__)
```