nemo_automodel.components.datasets.vlm.samplers

View as Markdown

Module Contents

Classes

NameDescription
LengthGroupedSamplerSampler that groups samples by total token count for balanced

Functions

NameDescription
_smart_resize_imageCompute the resized (height, width) for an image, matching
_smart_resize_videoCompute the resized (height, width) for a video, matching

Data

logger

API

class nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler(
dataset,
seed = 42,
processor = None,
max_length = None,
batch_size = 1
)

Bases: Sampler

Sampler that groups samples by total token count for balanced distributed training.

With shard_data=True each rank owns a different subset of data. This sampler sorts every rank’s indices by total tokens (text_tokens + media_tokens, descending). All ranks share the same seed + epoch so position N on every rank corresponds to a sample of similar length, keeping cross-rank padding minimal.

Per-epoch randomness is achieved by rotating the sorted order by a deterministic random offset (same on every rank).

Parameters:

dataset

The dataset to sample from.

seed
Defaults to 42

Base random seed (same value on every rank).

processor
Defaults to None

Optional HuggingFace processor (e.g. Qwen2VLProcessor). Used to read image_processor / video_processor attributes for accurate media token estimation via smart_resize.

_image_cfg
_video_cfg
batch_size
= max(1, batch_size)
epoch
= 0
lengths
sorted_indices
nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler.__iter__()
nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler.__len__()
nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._compute_or_load_lengths(
dataset
)

Compute token lengths with direct list access for speed.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._estimate_image_tokens(
img_meta
)

Estimate token count for one image from its [height, width] metadata.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._estimate_tokens(
example
)

Return (text_tokens, media_tokens) for one example.

Uses pre-computed _text_tokens / _media_tokens when available (written by scripts/precompute_tokens.py). Otherwise falls back to heuristic estimation.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._estimate_video_tokens(
vid_meta
)

Estimate token count for one video from its [total_frames, height, width, fps, duration] metadata.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._extract_image_config(
processor
)
staticmethod
nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._extract_video_config(
processor
)
staticmethod
nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler._get_raw_samples(
dataset
)
staticmethod

Unwrap dataset wrappers to get the underlying list for direct access.

nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler.set_epoch(
epoch
)

Set the epoch for deterministic shuffling (standard PyTorch pattern).

nemo_automodel.components.datasets.vlm.samplers._smart_resize_image(
height,
width,
factor = 28,
min_pixels = 56 * 56,
max_pixels = 14 * 14 * 4 * 1280
)

Compute the resized (height, width) for an image, matching transformers.models.qwen2_vl.image_processing_qwen2_vl.smart_resize.

nemo_automodel.components.datasets.vlm.samplers._smart_resize_video(
num_frames,
height,
width,
temporal_factor = 2,
factor = 32,
min_pixels = 128 * 128,
max_pixels = 16 * 16 * 2 * 2 * 2 * 6144
)

Compute the resized (height, width) for a video, matching transformers.models.qwen3_vl.video_processing_qwen3_vl.smart_resize.

nemo_automodel.components.datasets.vlm.samplers.logger = logging.getLogger(__name__)