nemo_automodel.components.datasets.llm.neat_packing
nemo_automodel.components.datasets.llm.neat_packing
Neat packing (greedy knapsack) for LLM datasets.
This module provides an alternative packing strategy that uses a greedy
knapsack algorithm (min-heap) for better bin-packing utilization compared
to the sequential first-fit approach in packed_sequence.py.
The packed output uses an indexed attention mask (1, 2, 3, … per sub-sequence, 0 for padding) and reset position IDs so that it works with eager / SDPA attention backends — no Transformer Engine dependency.
Module Contents
Functions
Data
API
Concatenate multiple samples into a single packed sample.
Parameters:
List of sample dicts, each with input_ids and labels
(already autoregressive-shifted by the dataset).
Target packed sequence length (pad to this).
Token ID used for padding input_ids.
Returns: dict
Dict with input_ids, labels, attention_mask,
Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.
Samples are sorted by length in descending order. Each sample is assigned to the bin with the smallest current total that can still accommodate it. If no bin fits, a new bin is created.
Parameters:
Length of each sample.
Maximum capacity of each bin.
Returns: list[list[int]]
A list of bins, where each bin is a list of sample indices.
Pack a dataset using greedy knapsack for better utilization.
Parameters:
HuggingFace dataset or dataset dict.
Dataset split key (e.g. "train").
Target packed sequence length.
Optional cap on number of packs to create.
Token ID for padding.
If True, silently drop samples longer than
pack_size; otherwise raise ValueError.
Returns: Dataset
A HuggingFace Dataset of packed samples.