nemo_automodel.components.datasets.vlm.neat_packing_vlm#
Neat packing for VLM (vision-language model) pre-tokenized datasets.
Packing is split into two phases:
Plan (instant) — scan raw dataset for estimated token lengths, run
greedy_knapsackto assign samples to bins. No tokenization, no media loading.Materialize (lazy, in
__getitem__) — when the DataLoader requests pack k, load + tokenize + shift + concat the samples assigned to bin k. Runs in DataLoader worker processes, fully parallel.
This keeps the packing setup O(N) and lightweight, while the expensive
tokenization + media loading is distributed across num_workers.
Module Contents#
Classes#
A Dataset that materializes packs lazily in |
Functions#
Pack samples with standard FFD, then interleave bins by VT for balance. |
|
Estimate token count for one image from its |
|
Estimate token count for one video from its
|
|
Estimate token count from raw conversation without tokenization. |
|
Compute mRoPE 3D position IDs for a single sample. |
|
Apply per-sample autoregressive shift before concatenation. |
|
Concatenate multiple shifted VLM samples into one packed sample. |
|
Create a lazily-packed VLM dataset. |
Data#
API#
- nemo_automodel.components.datasets.vlm.neat_packing_vlm.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.vlm.neat_packing_vlm.MEDIA_KEYS#
(‘pixel_values’, ‘image_grid_thw’, ‘pixel_values_videos’, ‘video_grid_thw’, ‘second_per_grid_ts’)
- nemo_automodel.components.datasets.vlm.neat_packing_vlm.greedy_knapsack_vt_balanced(
- lengths: list[int],
- max_length: int,
- visual_tokens: list[int],
Pack samples with standard FFD, then interleave bins by VT for balance.
Uses the standard greedy knapsack (FFD) for optimal packing efficiency, then reorders bins so that consecutive packs have similar visual token counts. This ensures data-parallel ranks in the same training step process packs with comparable VIT workload, reducing straggler effects.
- Parameters:
lengths – Total token length (text + media) per sample.
max_length – Maximum capacity per pack.
visual_tokens – Number of media tokens per sample.
- Returns:
A list of bins, where each bin is a list of sample indices.
- nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_image_tokens(img_meta, image_cfg: dict) int#
Estimate token count for one image from its
[height, width]metadata.
- nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_video_tokens(vid_meta, video_cfg: dict) int#
Estimate token count for one video from its
[total_frames, height, width, fps, duration]metadata.
- nemo_automodel.components.datasets.vlm.neat_packing_vlm._estimate_sample_length(
- example: dict,
- image_cfg: dict | None = None,
- video_cfg: dict | None = None,
- return_media_tokens: bool = False,
Estimate token count from raw conversation without tokenization.
Uses pre-computed
_text_tokens(fromprecompute_tokens.py) when available, otherwise falls back tochars // 3. Media tokens are estimated viasmart_resizewhen processor configs are provided, otherwise falls back to 500 per media item.- Parameters:
return_media_tokens – If True, return
(total_tokens, media_tokens)instead of justtotal_tokens.
- nemo_automodel.components.datasets.vlm.neat_packing_vlm._compute_mrope_position_ids(
- sample: dict,
- get_rope_index: Callable,
Compute mRoPE 3D position IDs for a single sample.
Returns
[3, seq_len]orNoneif not applicable.
- nemo_automodel.components.datasets.vlm.neat_packing_vlm._shift_sample(sample: dict, has_mrope: bool = False) dict#
Apply per-sample autoregressive shift before concatenation.
- nemo_automodel.components.datasets.vlm.neat_packing_vlm._build_packed_vlm_sample(
- samples: list[dict],
- pack_size: int,
- padding_idx: int,
- has_mrope: bool = False,
Concatenate multiple shifted VLM samples into one packed sample.
- class nemo_automodel.components.datasets.vlm.neat_packing_vlm.PackedDatasetWrapper(
- inner_dataset,
- bins: list[list[int]],
- pack_size: int,
- padding_idx: int = 0,
- get_rope_index: Callable | None = None,
- max_retries: int = 10,
Bases:
torch.utils.data.DatasetA Dataset that materializes packs lazily in
__getitem__.The constructor only stores bin assignments (which sample indices go into each pack). The actual tokenization, media loading, shift, and concatenation happen when a pack is requested — inside DataLoader worker processes, fully parallel.
- Parameters:
inner_dataset – The
PreTokenizedDatasetWrapperthat tokenizes individual samples.bins – List of bins from
greedy_knapsack, where each bin is a list of sample indices intoinner_dataset.pack_size – Target packed sequence length (after shift).
padding_idx – Token ID for padding.
get_rope_index – Optional
model.get_rope_indexfor mRoPE.max_retries – Max retries when a sample fails to tokenize.
Initialization
- __len__()#
- __getitem__(pack_idx: int) dict#
Materialize one pack: tokenize + shift + concat all samples in the bin.
- robust_collate(collate_fn)#
Wrap collate_fn with retry logic, delegating to inner dataset.
- nemo_automodel.components.datasets.vlm.neat_packing_vlm.neat_pack_dataset_vlm(
- dataset,
- pack_size: int,
- padding_idx: int = 0,
- drop_long_samples: bool = False,
- max_packs: int | None = None,
- get_rope_index: Callable | None = None,
- ds_raw=None,
- packing_ratio: float = 1.0,
- processor=None,
- balance_media_tokens: bool = True,
Create a lazily-packed VLM dataset.
Estimates token lengths from
ds_raw(no tokenization).Runs knapsack to assign samples to bins. When
balance_media_tokens=True(default), uses a two-phase algorithm that balances visual token counts across packs, reducing VIT compute/memory imbalance and straggler effects.Returns a
PackedDatasetWrapperwhose__getitem__tokenizes and builds packs on-the-fly in DataLoader workers.
- Parameters:
dataset –
PreTokenizedDatasetWrapperfor per-sample tokenization.pack_size – Target packed sequence length (after shift).
padding_idx – Token ID for padding.
drop_long_samples – Drop samples whose estimated length exceeds
pack_size.max_packs – Optional cap on number of packs.
get_rope_index – Optional
model.get_rope_indexfor mRoPE.ds_raw – Raw dataset (conversations) for fast length estimation. Falls back to
len(dataset)if not provided.packing_ratio – Fill ratio for knapsack bins (default 1.0). E.g.
0.9means knapsack only fills bins topack_size * 0.9, leaving 10% headroom to absorb estimation errors. This reduces overflow drops at__getitem__time. The actualpack_sizeis still used as the hard limit.processor – Optional HuggingFace processor (e.g.
Qwen2VLProcessor). Used to extractimage_processor/video_processorconfigs for accurate media token estimation viasmart_resize.balance_media_tokens – If True (default), use VT-balanced knapsack that distributes visual tokens evenly across packs. Falls back to standard knapsack if no media tokens are detected.
- Returns:
A
PackedDatasetWrapper(torch Dataset).