nemo_automodel.components.datasets.llm.neat_packing#
Neat packing (greedy knapsack) for LLM datasets.
This module provides an alternative packing strategy that uses a greedy
knapsack algorithm (min-heap) for better bin-packing utilization compared
to the sequential first-fit approach in packed_sequence.py.
The packed output uses an indexed attention mask (1, 2, 3, β¦ per sub-sequence, 0 for padding) and reset position IDs so that it works with eager / SDPA attention backends β no Transformer Engine dependency.
Module Contents#
Functions#
Bin-pack sample indices using a greedy knapsack (min-heap) algorithm. |
|
Concatenate multiple samples into a single packed sample. |
|
Pack a dataset using greedy knapsack for better utilization. |
Data#
API#
- nemo_automodel.components.datasets.llm.neat_packing.logger#
βgetLogger(β¦)β
- nemo_automodel.components.datasets.llm.neat_packing.CROSS_ENTROPY_IGNORE_IDX#
None
- nemo_automodel.components.datasets.llm.neat_packing.greedy_knapsack(
- lengths: list[int],
- max_length: int,
Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.
Samples are sorted by length in descending order. Each sample is assigned to the bin with the smallest current total that can still accommodate it. If no bin fits, a new bin is created.
- Parameters:
lengths β Length of each sample.
max_length β Maximum capacity of each bin.
- Returns:
A list of bins, where each bin is a list of sample indices.
- nemo_automodel.components.datasets.llm.neat_packing._build_packed_sample(
- samples: list[dict],
- pack_size: int,
- padding_idx: int,
Concatenate multiple samples into a single packed sample.
- Parameters:
samples β List of sample dicts, each with
input_idsandlabels(already autoregressive-shifted by the dataset).pack_size β Target packed sequence length (pad to this).
padding_idx β Token ID used for padding
input_ids.
- Returns:
Dict with
input_ids,labels,attention_mask,position_idsβ all tensors of shape[pack_size].
- nemo_automodel.components.datasets.llm.neat_packing.neat_pack_dataset(
- dataset,
- split: str,
- pack_size: int,
- max_packs: int | None = None,
- padding_idx: int = 0,
- drop_long_samples: bool = False,
Pack a dataset using greedy knapsack for better utilization.
- Parameters:
dataset β HuggingFace dataset or dataset dict.
split β Dataset split key (e.g.
"train").pack_size β Target packed sequence length.
max_packs β Optional cap on number of packs to create.
padding_idx β Token ID for padding.
drop_long_samples β If
True, silently drop samples longer thanpack_size; otherwise raiseValueError.
- Returns:
A HuggingFace
Datasetof packed samples.