nemo_automodel.components.datasets.vlm.neat_packing_vlm

View as Markdown

Neat packing for VLM (vision-language model) pre-tokenized datasets.

Packing is split into two phases:

  1. Plan (instant) — scan raw dataset for estimated token lengths, run greedy_knapsack to assign samples to bins. No tokenization, no media loading.
  2. Materialize (lazy, in __getitem__) — when the DataLoader requests pack k, load + tokenize + shift + concat the samples assigned to bin k. Runs in DataLoader worker processes, fully parallel.

This keeps the packing setup O(N) and lightweight, while the expensive tokenization + media loading is distributed across num_workers.

Module Contents

Classes

NameDescription
PackedDatasetWrapperA Dataset that materializes packs lazily in __getitem__.

Functions

NameDescription
_build_packed_vlm_sampleConcatenate multiple shifted VLM samples into one packed sample.
_compute_mrope_position_idsCompute mRoPE 3D position IDs for a single sample.
_estimate_image_tokensEstimate token count for one image from its [height, width] metadata.
_estimate_sample_lengthEstimate token count from raw conversation without tokenization.
_estimate_video_tokensEstimate token count for one video from its
_shift_sampleApply per-sample autoregressive shift before concatenation.
greedy_knapsack_vt_balancedPack samples with standard FFD, then interleave bins by VT for balance.
neat_pack_dataset_vlmCreate a lazily-packed VLM dataset.

Data

MEDIA_KEYS

logger

API

class nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper(
inner_dataset,
bins: list[list[int]],
pack_size: int,
padding_idx: int = 0,
get_rope_index: typing.Callable | None = None,
max_retries: int = 10
)

Bases: Dataset

A Dataset that materializes packs lazily in __getitem__.

The constructor only stores bin assignments (which sample indices go into each pack). The actual tokenization, media loading, shift, and concatenation happen when a pack is requested — inside DataLoader worker processes, fully parallel.

Parameters:

inner_dataset

The PreTokenizedDatasetWrapper that tokenizes individual samples.

bins
list[list[int]]

List of bins from greedy_knapsack, where each bin is a list of sample indices into inner_dataset.

pack_size
int

Target packed sequence length (after shift).

padding_idx
intDefaults to 0

Token ID for padding.

get_rope_index
Callable | NoneDefaults to None

Optional model.get_rope_index for mRoPE.

max_retries
intDefaults to 10

Max retries when a sample fails to tokenize.

has_mrope
= get_rope_index is not None
nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper.__getitem__(
pack_idx: int
) -> dict

Materialize one pack: tokenize + shift + concat all samples in the bin.

nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper.__len__()
nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper.robust_collate(
collate_fn
)

Wrap collate_fn with retry logic, delegating to inner dataset.

nemo_automodel.components.datasets.vlm.neat_packing_vlm._build_packed_vlm_sample(
samples: list[dict],
pack_size: int,
padding_idx: int,
has_mrope: bool = False
) -> dict

Concatenate multiple shifted VLM samples into one packed sample.

nemo_automodel.components.datasets.vlm.neat_packing_vlm._compute_mrope_position_ids(
sample: dict,
get_rope_index: typing.Callable
) -> torch.Tensor | None

Compute mRoPE 3D position IDs for a single sample.

Returns [3, seq_len] or None if not applicable.

nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_image_tokens(
img_meta,
image_cfg: dict
) -> int

Estimate token count for one image from its [height, width] metadata.

nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_sample_length(
example: dict,
image_cfg: dict | None = None,
video_cfg: dict | None = None,
return_media_tokens: bool = False
) -> int | tuple[int, int]

Estimate token count from raw conversation without tokenization.

Uses pre-computed _text_tokens (from precompute_tokens.py) when available, otherwise falls back to chars // 3. Media tokens are estimated via smart_resize when processor configs are provided, otherwise falls back to 500 per media item.

Parameters:

return_media_tokens
boolDefaults to False

If True, return (total_tokens, media_tokens) instead of just total_tokens.

nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_video_tokens(
vid_meta,
video_cfg: dict
) -> int

Estimate token count for one video from its [total_frames, height, width, fps, duration] metadata.

nemo_automodel.components.datasets.vlm.neat_packing_vlm._shift_sample(
sample: dict,
has_mrope: bool = False
) -> dict

Apply per-sample autoregressive shift before concatenation.

nemo_automodel.components.datasets.vlm.neat_packing_vlm.greedy_knapsack_vt_balanced(
lengths: list[int],
max_length: int,
visual_tokens: list[int]
) -> list[list[int]]

Pack samples with standard FFD, then interleave bins by VT for balance.

Uses the standard greedy knapsack (FFD) for optimal packing efficiency, then reorders bins so that consecutive packs have similar visual token counts. This ensures data-parallel ranks in the same training step process packs with comparable VIT workload, reducing straggler effects.

Parameters:

lengths
list[int]

Total token length (text + media) per sample.

max_length
int

Maximum capacity per pack.

visual_tokens
list[int]

Number of media tokens per sample.

Returns: list[list[int]]

A list of bins, where each bin is a list of sample indices.

nemo_automodel.components.datasets.vlm.neat_packing_vlm.neat_pack_dataset_vlm(
dataset,
pack_size: int,
padding_idx: int = 0,
drop_long_samples: bool = False,
max_packs: int | None = None,
get_rope_index: typing.Callable | None = None,
ds_raw = None,
packing_ratio: float = 1.0,
processor = None,
balance_media_tokens: bool = True
) -> nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper

Create a lazily-packed VLM dataset.

  1. Estimates token lengths from ds_raw (no tokenization).
  2. Runs knapsack to assign samples to bins. When balance_media_tokens=True (default), uses a two-phase algorithm that balances visual token counts across packs, reducing VIT compute/memory imbalance and straggler effects.
  3. Returns a PackedDatasetWrapper whose __getitem__ tokenizes and builds packs on-the-fly in DataLoader workers.

Parameters:

dataset

PreTokenizedDatasetWrapper for per-sample tokenization.

pack_size
int

Target packed sequence length (after shift).

padding_idx
intDefaults to 0

Token ID for padding.

drop_long_samples
boolDefaults to False

Drop samples whose estimated length exceeds pack_size.

max_packs
int | NoneDefaults to None

Optional cap on number of packs.

get_rope_index
Callable | NoneDefaults to None

Optional model.get_rope_index for mRoPE.

ds_raw
Defaults to None

Raw dataset (conversations) for fast length estimation. Falls back to len(dataset) if not provided.

packing_ratio
floatDefaults to 1.0

Fill ratio for knapsack bins (default 1.0). E.g. 0.9 means knapsack only fills bins to pack_size * 0.9, leaving 10% headroom to absorb estimation errors. This reduces overflow drops at __getitem__ time. The actual pack_size is still used as the hard limit.

processor
Defaults to None

Optional HuggingFace processor (e.g. Qwen2VLProcessor). Used to extract image_processor / video_processor configs for accurate media token estimation via smart_resize.

balance_media_tokens
boolDefaults to True

If True (default), use VT-balanced knapsack that distributes visual tokens evenly across packs. Falls back to standard knapsack if no media tokens are detected.

Returns: PackedDatasetWrapper

A PackedDatasetWrapper (torch Dataset).

nemo_automodel.components.datasets.vlm.neat_packing_vlm.MEDIA_KEYS = ('pixel_values', 'image_grid_thw', 'image_position_ids', 'pixel_values_videos', ...
nemo_automodel.components.datasets.vlm.neat_packing_vlm.logger = logging.getLogger(__name__)