> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.vlm.neat_packing_vlm

Neat packing for VLM (vision-language model) pre-tokenized datasets.

Packing is split into two phases:

1. **Plan** (instant) — scan raw dataset for estimated token lengths,
   run `greedy_knapsack` to assign samples to bins.  No tokenization,
   no media loading.
2. **Materialize** (lazy, in `__getitem__`) — when the DataLoader
   requests pack *k*, load + tokenize + shift + concat the samples
   assigned to bin *k*.  Runs in DataLoader worker processes, fully
   parallel.

This keeps the packing setup O(N) and lightweight, while the expensive
tokenization + media loading is distributed across `num_workers`.

## Module Contents

### Classes

| Name                                                                                                    | Description                                                |
| ------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
| [`PackedDatasetWrapper`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-PackedDatasetWrapper) | A Dataset that materializes packs lazily in `__getitem__`. |

### Functions

| Name                                                                                                                  | Description                                                             |
| --------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| [`_build_packed_vlm_sample`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-_build_packed_vlm_sample)       | Concatenate multiple shifted VLM samples into one packed sample.        |
| [`_compute_mrope_position_ids`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-_compute_mrope_position_ids) | Compute mRoPE 3D position IDs for a single sample.                      |
| [`_estimate_image_tokens`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-_estimate_image_tokens)           | Estimate token count for one image from its `[height, width]` metadata. |
| [`_estimate_sample_length`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-_estimate_sample_length)         | Estimate token count from raw conversation without tokenization.        |
| [`_estimate_video_tokens`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-_estimate_video_tokens)           | Estimate token count for one video from its                             |
| [`_shift_sample`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-_shift_sample)                             | Apply per-sample autoregressive shift before concatenation.             |
| [`greedy_knapsack_vt_balanced`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-greedy_knapsack_vt_balanced) | Pack samples with standard FFD, then interleave bins by VT for balance. |
| [`neat_pack_dataset_vlm`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-neat_pack_dataset_vlm)             | Create a lazily-packed VLM dataset.                                     |

### Data

[`MEDIA_KEYS`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-MEDIA_KEYS)

[`logger`](#nemo_automodel-components-datasets-vlm-neat_packing_vlm-logger)

### API

```python
class nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper(
    inner_dataset,
    bins: list[list[int]],
    pack_size: int,
    padding_idx: int = 0,
    get_rope_index: typing.Callable | None = None,
    max_retries: int = 10
)
```

**Bases:** `Dataset`

A Dataset that materializes packs lazily in `__getitem__`.

The constructor only stores bin assignments (which sample indices go
into each pack).  The actual tokenization, media loading, shift, and
concatenation happen when a pack is requested — inside DataLoader
worker processes, fully parallel.

**Parameters:**

The `PreTokenizedDatasetWrapper` that tokenizes
individual samples.

List of bins from `greedy_knapsack`, where each bin is a
list of sample indices into `inner_dataset`.

Target packed sequence length (after shift).

Token ID for padding.

Optional `model.get_rope_index` for mRoPE.

Max retries when a sample fails to tokenize.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper.__getitem__(
    pack_idx: int
) -> dict
```

Materialize one pack: tokenize + shift + concat all samples in the bin.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper.__len__()
```

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper.robust_collate(
    collate_fn
)
```

Wrap collate\_fn with retry logic, delegating to inner dataset.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm._build_packed_vlm_sample(
    samples: list[dict],
    pack_size: int,
    padding_idx: int,
    has_mrope: bool = False
) -> dict
```

Concatenate multiple shifted VLM samples into one packed sample.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm._compute_mrope_position_ids(
    sample: dict,
    get_rope_index: typing.Callable
) -> torch.Tensor | None
```

Compute mRoPE 3D position IDs for a single sample.

Returns `[3, seq_len]` or `None` if not applicable.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_image_tokens(
    img_meta,
    image_cfg: dict
) -> int
```

Estimate token count for one image from its `[height, width]` metadata.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_sample_length(
    example: dict,
    image_cfg: dict | None = None,
    video_cfg: dict | None = None,
    return_media_tokens: bool = False
) -> int | tuple[int, int]
```

Estimate token count from raw conversation without tokenization.

Uses pre-computed `_text_tokens` (from `precompute_tokens.py`) when
available, otherwise falls back to `chars // 3`.  Media tokens are
estimated via `smart_resize` when processor configs are provided,
otherwise falls back to 500 per media item.

**Parameters:**

If True, return `(total_tokens, media_tokens)`
instead of just `total_tokens`.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_video_tokens(
    vid_meta,
    video_cfg: dict
) -> int
```

Estimate token count for one video from its
`[total_frames, height, width, fps, duration]` metadata.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm._shift_sample(
    sample: dict,
    has_mrope: bool = False
) -> dict
```

Apply per-sample autoregressive shift before concatenation.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm.greedy_knapsack_vt_balanced(
    lengths: list[int],
    max_length: int,
    visual_tokens: list[int]
) -> list[list[int]]
```

Pack samples with standard FFD, then interleave bins by VT for balance.

Uses the standard greedy knapsack (FFD) for optimal packing efficiency,
then reorders bins so that consecutive packs have similar visual token
counts.  This ensures data-parallel ranks in the same training step
process packs with comparable VIT workload, reducing straggler effects.

**Parameters:**

Total token length (text + media) per sample.

Maximum capacity per pack.

Number of media tokens per sample.

**Returns:** `list[list[int]]`

A list of bins, where each bin is a list of sample indices.

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm.neat_pack_dataset_vlm(
    dataset,
    pack_size: int,
    padding_idx: int = 0,
    drop_long_samples: bool = False,
    max_packs: int | None = None,
    get_rope_index: typing.Callable | None = None,
    ds_raw = None,
    packing_ratio: float = 1.0,
    processor = None,
    balance_media_tokens: bool = True
) -> nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper
```

Create a lazily-packed VLM dataset.

1. Estimates token lengths from `ds_raw` (no tokenization).
2. Runs knapsack to assign samples to bins.  When
   `balance_media_tokens=True` (default), uses a two-phase
   algorithm that balances visual token counts across packs,
   reducing VIT compute/memory imbalance and straggler effects.
3. Returns a `PackedDatasetWrapper` whose `__getitem__` tokenizes
   and builds packs on-the-fly in DataLoader workers.

**Parameters:**

`PreTokenizedDatasetWrapper` for per-sample tokenization.

Target packed sequence length (after shift).

Token ID for padding.

Drop samples whose estimated length exceeds
`pack_size`.

Optional cap on number of packs.

Optional `model.get_rope_index` for mRoPE.

Raw dataset (conversations) for fast length estimation.
Falls back to `len(dataset)` if not provided.

Fill ratio for knapsack bins (default 1.0).
E.g. `0.9` means knapsack only fills bins to `pack_size * 0.9`,
leaving 10% headroom to absorb estimation errors.  This reduces
overflow drops at `__getitem__` time.  The actual `pack_size`
is still used as the hard limit.

Optional HuggingFace processor (e.g. `Qwen2VLProcessor`).
Used to extract `image_processor` / `video_processor` configs
for accurate media token estimation via `smart_resize`.

If True (default), use VT-balanced knapsack
that distributes visual tokens evenly across packs.  Falls back
to standard knapsack if no media tokens are detected.

**Returns:** `PackedDatasetWrapper`

A `PackedDatasetWrapper` (torch Dataset).

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm.MEDIA_KEYS = ('pixel_values', 'image_grid_thw', 'image_position_ids', 'pixel_values_videos', ...
```

```python
nemo_automodel.components.datasets.vlm.neat_packing_vlm.logger = logging.getLogger(__name__)
```