nemo_automodel.components.datasets.llm.neat_packing#

Neat packing (greedy knapsack) for LLM datasets.

This module provides an alternative packing strategy that uses a greedy knapsack algorithm (min-heap) for better bin-packing utilization compared to the sequential first-fit approach in packed_sequence.py.

The packed output uses an indexed attention mask (1, 2, 3, … per sub-sequence, 0 for padding) and reset position IDs so that it works with eager / SDPA attention backends β€” no Transformer Engine dependency.

Module Contents#

Functions#

greedy_knapsack

Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.

_build_packed_sample

Concatenate multiple samples into a single packed sample.

neat_pack_dataset

Pack a dataset using greedy knapsack for better utilization.

Data#

API#

nemo_automodel.components.datasets.llm.neat_packing.logger#

β€˜getLogger(…)’

nemo_automodel.components.datasets.llm.neat_packing.CROSS_ENTROPY_IGNORE_IDX#

None

nemo_automodel.components.datasets.llm.neat_packing.greedy_knapsack(
lengths: list[int],
max_length: int,
) list[list[int]]#

Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.

Samples are sorted by length in descending order. Each sample is assigned to the bin with the smallest current total that can still accommodate it. If no bin fits, a new bin is created.

Parameters:
  • lengths – Length of each sample.

  • max_length – Maximum capacity of each bin.

Returns:

A list of bins, where each bin is a list of sample indices.

nemo_automodel.components.datasets.llm.neat_packing._build_packed_sample(
samples: list[dict],
pack_size: int,
padding_idx: int,
) dict#

Concatenate multiple samples into a single packed sample.

Parameters:
  • samples – List of sample dicts, each with input_ids and labels (already autoregressive-shifted by the dataset).

  • pack_size – Target packed sequence length (pad to this).

  • padding_idx – Token ID used for padding input_ids.

Returns:

Dict with input_ids, labels, attention_mask, position_ids β€” all tensors of shape [pack_size].

nemo_automodel.components.datasets.llm.neat_packing.neat_pack_dataset(
dataset,
split: str,
pack_size: int,
max_packs: int | None = None,
padding_idx: int = 0,
drop_long_samples: bool = False,
) datasets.Dataset#

Pack a dataset using greedy knapsack for better utilization.

Parameters:
  • dataset – HuggingFace dataset or dataset dict.

  • split – Dataset split key (e.g. "train").

  • pack_size – Target packed sequence length.

  • max_packs – Optional cap on number of packs to create.

  • padding_idx – Token ID for padding.

  • drop_long_samples – If True, silently drop samples longer than pack_size; otherwise raise ValueError.

Returns:

A HuggingFace Dataset of packed samples.