nemo_automodel.components.datasets.vlm.samplers#

Module Contents#

Classes#

LengthGroupedSampler

Sampler that groups samples by total token count for balanced distributed training.

Functions#

_smart_resize_image

Compute the resized (height, width) for an image, matching transformers.models.qwen2_vl.image_processing_qwen2_vl.smart_resize.

_smart_resize_video

Compute the resized (height, width) for a video, matching transformers.models.qwen3_vl.video_processing_qwen3_vl.smart_resize.

Data#

API#

nemo_automodel.components.datasets.vlm.samplers.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.vlm.samplers._smart_resize_image(
height,
width,
factor=28,
min_pixels=56 * 56,
max_pixels=14 * 14 * 4 * 1280,
)#

Compute the resized (height, width) for an image, matching transformers.models.qwen2_vl.image_processing_qwen2_vl.smart_resize.

nemo_automodel.components.datasets.vlm.samplers._smart_resize_video(
num_frames,
height,
width,
temporal_factor=2,
factor=32,
min_pixels=128 * 128,
max_pixels=16 * 16 * 2 * 2 * 2 * 6144,
)#

Compute the resized (height, width) for a video, matching transformers.models.qwen3_vl.video_processing_qwen3_vl.smart_resize.

class nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler(
dataset,
seed=42,
processor=None,
max_length=None,
batch_size=1,
)#

Bases: torch.utils.data.Sampler

Sampler that groups samples by total token count for balanced distributed training.

With shard_data=True each rank owns a different subset of data. This sampler sorts every rank’s indices by total tokens (text_tokens + media_tokens, descending). All ranks share the same seed + epoch so position N on every rank corresponds to a sample of similar length, keeping cross-rank padding minimal.

Per-epoch randomness is achieved by rotating the sorted order by a deterministic random offset (same on every rank).

Parameters:
  • dataset – The dataset to sample from.

  • seed – Base random seed (same value on every rank).

  • processor – Optional HuggingFace processor (e.g. Qwen2VLProcessor). Used to read image_processor / video_processor attributes for accurate media token estimation via smart_resize.

Initialization

static _get_raw_samples(dataset)#

Unwrap dataset wrappers to get the underlying list for direct access.

_compute_or_load_lengths(dataset)#

Compute token lengths with direct list access for speed.

static _extract_image_config(processor)#
static _extract_video_config(processor)#
_estimate_image_tokens(img_meta)#

Estimate token count for one image from its [height, width] metadata.

_estimate_video_tokens(vid_meta)#

Estimate token count for one video from its [total_frames, height, width, fps, duration] metadata.

_estimate_tokens(example)#

Return (text_tokens, media_tokens) for one example.

Uses pre-computed _text_tokens / _media_tokens when available (written by scripts/precompute_tokens.py). Otherwise falls back to heuristic estimation.

set_epoch(epoch)#

Set the epoch for deterministic shuffling (standard PyTorch pattern).

__iter__()#
__len__()#