nemo_automodel.components.datasets.llm.neat_packing

View as Markdown

Neat packing (greedy knapsack) for LLM datasets.

This module provides an alternative packing strategy that uses a greedy knapsack algorithm (min-heap) for better bin-packing utilization compared to the sequential first-fit approach in packed_sequence.py.

The packed output uses an indexed attention mask (1, 2, 3, … per sub-sequence, 0 for padding) and reset position IDs so that it works with eager / SDPA attention backends — no Transformer Engine dependency.

Module Contents

Functions

NameDescription
_build_packed_sampleConcatenate multiple samples into a single packed sample.
greedy_knapsackBin-pack sample indices using a greedy knapsack (min-heap) algorithm.
neat_pack_datasetPack a dataset using greedy knapsack for better utilization.

Data

CROSS_ENTROPY_IGNORE_IDX

logger

API

nemo_automodel.components.datasets.llm.neat_packing._build_packed_sample(
samples: list[dict],
pack_size: int,
padding_idx: int
) -> dict

Concatenate multiple samples into a single packed sample.

Parameters:

samples
list[dict]

List of sample dicts, each with input_ids and labels (already autoregressive-shifted by the dataset).

pack_size
int

Target packed sequence length (pad to this).

padding_idx
int

Token ID used for padding input_ids.

Returns: dict

Dict with input_ids, labels, attention_mask,

nemo_automodel.components.datasets.llm.neat_packing.greedy_knapsack(
lengths: list[int],
max_length: int
) -> list[list[int]]

Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.

Samples are sorted by length in descending order. Each sample is assigned to the bin with the smallest current total that can still accommodate it. If no bin fits, a new bin is created.

Parameters:

lengths
list[int]

Length of each sample.

max_length
int

Maximum capacity of each bin.

Returns: list[list[int]]

A list of bins, where each bin is a list of sample indices.

nemo_automodel.components.datasets.llm.neat_packing.neat_pack_dataset(
dataset,
split: str,
pack_size: int,
max_packs: int | None = None,
padding_idx: int = 0,
drop_long_samples: bool = False
) -> datasets.Dataset

Pack a dataset using greedy knapsack for better utilization.

Parameters:

dataset

HuggingFace dataset or dataset dict.

split
str

Dataset split key (e.g. "train").

pack_size
int

Target packed sequence length.

max_packs
int | NoneDefaults to None

Optional cap on number of packs to create.

padding_idx
intDefaults to 0

Token ID for padding.

drop_long_samples
boolDefaults to False

If True, silently drop samples longer than pack_size; otherwise raise ValueError.

Returns: Dataset

A HuggingFace Dataset of packed samples.

nemo_automodel.components.datasets.llm.neat_packing.CROSS_ENTROPY_IGNORE_IDX = -100
nemo_automodel.components.datasets.llm.neat_packing.logger = logging.getLogger(__name__)