`nemo_automodel.components.datasets.vlm.samplers`#

Module Contents#

Classes#

LengthGroupedSampler

Sampler that groups samples by total token count for balanced distributed training.

Functions#

`_smart_resize_image`	Compute the resized (height, width) for an image, matching `transformers.models.qwen2_vl.image_processing_qwen2_vl.smart_resize`.
`_smart_resize_video`	Compute the resized (height, width) for a video, matching `transformers.models.qwen3_vl.video_processing_qwen3_vl.smart_resize`.

Data#

logger

API#

nemo_automodel.components.datasets.vlm.samplers.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.vlm.samplers._smart_resize_image( height, width, factor=28, min_pixels=56 * 56, max_pixels=14 * 14 * 4 * 1280, )#: Compute the resized (height, width) for an image, matching transformers.models.qwen2_vl.image_processing_qwen2_vl.smart_resize.

nemo_automodel.components.datasets.vlm.samplers._smart_resize_video( num_frames, height, width, temporal_factor=2, factor=32, min_pixels=128 * 128, max_pixels=16 * 16 * 2 * 2 * 2 * 6144, )#: Compute the resized (height, width) for a video, matching transformers.models.qwen3_vl.video_processing_qwen3_vl.smart_resize.

class nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler( dataset, seed=42, processor=None, max_length=None, batch_size=1, )#

Bases: torch.utils.data.Sampler

Sampler that groups samples by total token count for balanced distributed training.

With shard_data=True each rank owns a different subset of data. This sampler sorts every rank’s indices by total tokens (text_tokens + media_tokens, descending). All ranks share the same seed + epoch so position N on every rank corresponds to a sample of similar length, keeping cross-rank padding minimal.

Per-epoch randomness is achieved by rotating the sorted order by a deterministic random offset (same on every rank).

Parameters:

dataset – The dataset to sample from.
seed – Base random seed (same value on every rank).
processor – Optional HuggingFace processor (e.g. Qwen2VLProcessor). Used to read image_processor / video_processor attributes for accurate media token estimation via smart_resize.

Initialization

static _get_raw_samples(dataset)#: Unwrap dataset wrappers to get the underlying list for direct access.

_compute_or_load_lengths(dataset)#: Compute token lengths with direct list access for speed.

static _extract_image_config(processor)#

static _extract_video_config(processor)#

_estimate_image_tokens(img_meta)#: Estimate token count for one image from its [height, width] metadata.

_estimate_video_tokens(vid_meta)#: Estimate token count for one video from its [total_frames, height, width, fps, duration] metadata.

_estimate_tokens(example)#

Return (text_tokens, media_tokens) for one example.

Uses pre-computed _text_tokens / _media_tokens when available (written by scripts/precompute_tokens.py). Otherwise falls back to heuristic estimation.

set_epoch(epoch)#: Set the epoch for deterministic shuffling (standard PyTorch pattern).

__iter__()#

__len__()#

nemo_automodel.components.datasets.vlm.samplers#