> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.neat_packing

Neat packing (greedy knapsack) for LLM datasets.

This module provides an alternative packing strategy that uses a greedy
knapsack algorithm (min-heap) for better bin-packing utilization compared
to the sequential first-fit approach in `packed_sequence.py`.

The packed output uses an **indexed attention mask** (1, 2, 3, ... per
sub-sequence, 0 for padding) and **reset position IDs** so that it works
with eager / SDPA attention backends — no Transformer Engine dependency.

## Module Contents

### Functions

| Name                                                                                                | Description                                                           |
| --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| [`_build_packed_sample`](#nemo_automodel-components-datasets-llm-neat_packing-_build_packed_sample) | Concatenate multiple samples into a single packed sample.             |
| [`greedy_knapsack`](#nemo_automodel-components-datasets-llm-neat_packing-greedy_knapsack)           | Bin-pack sample indices using a greedy knapsack (min-heap) algorithm. |
| [`neat_pack_dataset`](#nemo_automodel-components-datasets-llm-neat_packing-neat_pack_dataset)       | Pack a dataset using greedy knapsack for better utilization.          |

### Data

[`CROSS_ENTROPY_IGNORE_IDX`](#nemo_automodel-components-datasets-llm-neat_packing-CROSS_ENTROPY_IGNORE_IDX)

[`logger`](#nemo_automodel-components-datasets-llm-neat_packing-logger)

### API

```python
nemo_automodel.components.datasets.llm.neat_packing._build_packed_sample(
    samples: list[dict],
    pack_size: int,
    padding_idx: int
) -> dict
```

Concatenate multiple samples into a single packed sample.

**Parameters:**

List of sample dicts, each with `input_ids` and `labels`
(already autoregressive-shifted by the dataset).

Target packed sequence length (pad to this).

Token ID used for padding `input_ids`.

**Returns:** `dict`

Dict with `input_ids`, `labels`, `attention_mask`,

```python
nemo_automodel.components.datasets.llm.neat_packing.greedy_knapsack(
    lengths: list[int],
    max_length: int
) -> list[list[int]]
```

Bin-pack sample indices using a greedy knapsack (min-heap) algorithm.

Samples are sorted by length in descending order.  Each sample is
assigned to the bin with the smallest current total that can still
accommodate it.  If no bin fits, a new bin is created.

**Parameters:**

Length of each sample.

Maximum capacity of each bin.

**Returns:** `list[list[int]]`

A list of bins, where each bin is a list of sample indices.

```python
nemo_automodel.components.datasets.llm.neat_packing.neat_pack_dataset(
    dataset,
    split: str,
    pack_size: int,
    max_packs: int | None = None,
    padding_idx: int = 0,
    drop_long_samples: bool = False
) -> datasets.Dataset
```

Pack a dataset using greedy knapsack for better utilization.

**Parameters:**

HuggingFace dataset or dataset dict.

Dataset split key (e.g. `"train"`).

Target packed sequence length.

Optional cap on number of packs to create.

Token ID for padding.

If `True`, silently drop samples longer than
`pack_size`; otherwise raise `ValueError`.

**Returns:** `Dataset`

A HuggingFace `Dataset` of packed samples.

```python
nemo_automodel.components.datasets.llm.neat_packing.CROSS_ENTROPY_IGNORE_IDX = -100
```

```python
nemo_automodel.components.datasets.llm.neat_packing.logger = logging.getLogger(__name__)
```