`nemo_automodel.components.datasets.llm.neat_packing`#

Neat packing (greedy knapsack) for LLM datasets.

This module provides an alternative packing strategy that uses a greedy knapsack algorithm (min-heap) for better bin-packing utilization compared to the sequential first-fit approach in packed_sequence.py.

The packed output uses an indexed attention mask (1, 2, 3, … per sub-sequence, 0 for padding) and reset position IDs so that it works with eager / SDPA attention backends — no Transformer Engine dependency.

Module Contents#

Functions#

`greedy_knapsack`	Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.
`_build_packed_sample`	Concatenate multiple samples into a single packed sample.
`neat_pack_dataset`	Pack a dataset using greedy knapsack for better utilization.

Data#

`logger`
`CROSS_ENTROPY_IGNORE_IDX`

API#

nemo_automodel.components.datasets.llm.neat_packing.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.llm.neat_packing.CROSS_ENTROPY_IGNORE_IDX#: None

nemo_automodel.components.datasets.llm.neat_packing.greedy_knapsack( lengths: list[int], max_length: int, ) → list[list[int]]#

Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.

Samples are sorted by length in descending order. Each sample is assigned to the bin with the smallest current total that can still accommodate it. If no bin fits, a new bin is created.

Parameters:

lengths – Length of each sample.
max_length – Maximum capacity of each bin.

Returns:

A list of bins, where each bin is a list of sample indices.

nemo_automodel.components.datasets.llm.neat_packing._build_packed_sample( samples: list[dict], pack_size: int, padding_idx: int, ) → dict#

Concatenate multiple samples into a single packed sample.

Parameters:

samples – List of sample dicts, each with input_ids and labels (already autoregressive-shifted by the dataset).
pack_size – Target packed sequence length (pad to this).
padding_idx – Token ID used for padding input_ids.

Returns:

Dict with input_ids, labels, attention_mask, position_ids — all tensors of shape [pack_size].

nemo_automodel.components.datasets.llm.neat_packing.neat_pack_dataset( dataset, split: str, pack_size: int, max_packs: int | None = None, padding_idx: int = 0, drop_long_samples: bool = False, ) → datasets.Dataset#

Pack a dataset using greedy knapsack for better utilization.

Parameters:

dataset – HuggingFace dataset or dataset dict.
split – Dataset split key (e.g. "train").
pack_size – Target packed sequence length.
max_packs – Optional cap on number of packs to create.
padding_idx – Token ID for padding.
drop_long_samples – If True, silently drop samples longer than pack_size; otherwise raise ValueError.

Returns:

A HuggingFace Dataset of packed samples.

nemo_automodel.components.datasets.llm.neat_packing#

Module Contents#

Functions#

Data#

API#

`nemo_automodel.components.datasets.llm.neat_packing`#