nemo_automodel.components.datasets.llm.neat_packing

Neat packing (greedy knapsack) for LLM datasets.

This module provides an alternative packing strategy that uses a greedy knapsack algorithm (min-heap) for better bin-packing utilization compared to the sequential first-fit approach in packed_sequence.py.

The packed output uses an indexed attention mask (1, 2, 3, … per sub-sequence, 0 for padding) and reset position IDs so that it works with eager / SDPA attention backends — no Transformer Engine dependency.

Module Contents

Functions

Name	Description
`_build_packed_sample`	Concatenate multiple samples into a single packed sample.
`greedy_knapsack`	Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.
`neat_pack_dataset`	Pack a dataset using greedy knapsack for better utilization.

Data

CROSS_ENTROPY_IGNORE_IDX

logger

API

nemo_automodel.components.datasets.llm.neat_packing._build_packed_sample(
    samples: list[dict],
    pack_size: int,
    padding_idx: int
) -> dict

Concatenate multiple samples into a single packed sample.

Parameters:

samples

list[dict]

List of sample dicts, each with input_ids and labels (already autoregressive-shifted by the dataset).

pack_size

int

Target packed sequence length (pad to this).

padding_idx

int

Token ID used for padding input_ids.

Returns: dict

Dict with input_ids, labels, attention_mask,

nemo_automodel.components.datasets.llm.neat_packing.greedy_knapsack(
    lengths: list[int],
    max_length: int
) -> list[list[int]]

Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.

Samples are sorted by length in descending order. Each sample is assigned to the bin with the smallest current total that can still accommodate it. If no bin fits, a new bin is created.

Parameters:

lengths

list[int]

Length of each sample.

max_length

int

Maximum capacity of each bin.

Returns: list[list[int]]

A list of bins, where each bin is a list of sample indices.

nemo_automodel.components.datasets.llm.neat_packing.neat_pack_dataset(
    dataset,
    split: str,
    pack_size: int,
    max_packs: int | None = None,
    padding_idx: int = 0,
    drop_long_samples: bool = False
) -> datasets.Dataset

Pack a dataset using greedy knapsack for better utilization.

Parameters:

dataset

HuggingFace dataset or dataset dict.

split

str

Dataset split key (e.g. "train").

pack_size

int

Target packed sequence length.

max_packs

int | NoneDefaults to None

Optional cap on number of packs to create.

padding_idx

intDefaults to 0

Token ID for padding.

drop_long_samples

boolDefaults to False

If True, silently drop samples longer than pack_size; otherwise raise ValueError.

Returns: Dataset

A HuggingFace Dataset of packed samples.

nemo_automodel.components.datasets.llm.neat_packing.CROSS_ENTROPY_IGNORE_IDX = -100

nemo_automodel.components.datasets.llm.neat_packing.logger = logging.getLogger(__name__)