nemo_automodel.components.datasets.vlm.samplers

Module Contents

Classes

Name	Description
`LengthGroupedSampler`	Sampler that groups samples by total token count for balanced

Functions

Name	Description
`_smart_resize_image`	Compute the resized (height, width) for an image, matching
`_smart_resize_video`	Compute the resized (height, width) for a video, matching

Data

logger

API

class nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler(
    dataset,
    seed = 42,
    processor = None,
    max_length = None,
    batch_size = 1
)

Bases: Sampler

Sampler that groups samples by total token count for balanced distributed training.

With shard_data=True each rank owns a different subset of data. This sampler sorts every rank’s indices by total tokens (text_tokens + media_tokens, descending). All ranks share the same seed + epoch so position N on every rank corresponds to a sample of similar length, keeping cross-rank padding minimal.

Per-epoch randomness is achieved by rotating the sorted order by a deterministic random offset (same on every rank).

Parameters:

dataset

The dataset to sample from.

seed

Defaults to 42

Base random seed (same value on every rank).

processor

Defaults to None

Optional HuggingFace processor (e.g. Qwen2VLProcessor). Used to read image_processor / video_processor attributes for accurate media token estimation via smart_resize.

_image_cfg

_video_cfg

batch_size

= max(1, batch_size)

epoch

= 0

lengths

sorted_indices

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler.__iter__()

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler.__len__()

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._compute_or_load_lengths(
    dataset
)

Compute token lengths with direct list access for speed.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._estimate_image_tokens(
    img_meta
)

Estimate token count for one image from its [height, width] metadata.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._estimate_tokens(
    example
)

Return (text_tokens, media_tokens) for one example.

Uses pre-computed _text_tokens / _media_tokens when available (written by scripts/precompute_tokens.py). Otherwise falls back to heuristic estimation.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._estimate_video_tokens(
    vid_meta
)

Estimate token count for one video from its [total_frames, height, width, fps, duration] metadata.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._extract_image_config(
    processor
)

staticmethod

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._extract_video_config(
    processor
)

staticmethod

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._get_raw_samples(
    dataset
)

staticmethod

Unwrap dataset wrappers to get the underlying list for direct access.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler.set_epoch(
    epoch
)

Set the epoch for deterministic shuffling (standard PyTorch pattern).

nemo_automodel.components.datasets.vlm.samplers._smart_resize_image(
    height,
    width,
    factor = 28,
    min_pixels = 56 * 56,
    max_pixels = 14 * 14 * 4 * 1280
)

Compute the resized (height, width) for an image, matching transformers.models.qwen2_vl.image_processing_qwen2_vl.smart_resize.

nemo_automodel.components.datasets.vlm.samplers._smart_resize_video(
    num_frames,
    height,
    width,
    temporal_factor = 2,
    factor = 32,
    min_pixels = 128 * 128,
    max_pixels = 16 * 16 * 2 * 2 * 2 * 6144
)

Compute the resized (height, width) for a video, matching transformers.models.qwen3_vl.video_processing_qwen3_vl.smart_resize.

nemo_automodel.components.datasets.vlm.samplers.logger = logging.getLogger(__name__)