nemo_automodel.components.datasets.vlm.samplers#
Module Contents#
Classes#
Sampler that groups samples by total token count for balanced distributed training. |
Functions#
Compute the resized (height, width) for an image, matching
|
|
Compute the resized (height, width) for a video, matching
|
Data#
API#
- nemo_automodel.components.datasets.vlm.samplers.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.vlm.samplers._smart_resize_image(
- height,
- width,
- factor=28,
- min_pixels=56 * 56,
- max_pixels=14 * 14 * 4 * 1280,
Compute the resized (height, width) for an image, matching
transformers.models.qwen2_vl.image_processing_qwen2_vl.smart_resize.
- nemo_automodel.components.datasets.vlm.samplers._smart_resize_video(
- num_frames,
- height,
- width,
- temporal_factor=2,
- factor=32,
- min_pixels=128 * 128,
- max_pixels=16 * 16 * 2 * 2 * 2 * 6144,
Compute the resized (height, width) for a video, matching
transformers.models.qwen3_vl.video_processing_qwen3_vl.smart_resize.
- class nemo_automodel.components.datasets.vlm.samplers.LengthGroupedSampler(
- dataset,
- seed=42,
- processor=None,
- max_length=None,
- batch_size=1,
Bases:
torch.utils.data.SamplerSampler that groups samples by total token count for balanced distributed training.
With
shard_data=Trueeach rank owns a different subset of data. This sampler sorts every rank’s indices by total tokens (text_tokens + media_tokens, descending). All ranks share the sameseed + epochso position N on every rank corresponds to a sample of similar length, keeping cross-rank padding minimal.Per-epoch randomness is achieved by rotating the sorted order by a deterministic random offset (same on every rank).
- Parameters:
dataset – The dataset to sample from.
seed – Base random seed (same value on every rank).
processor – Optional HuggingFace processor (e.g.
Qwen2VLProcessor). Used to readimage_processor/video_processorattributes for accurate media token estimation viasmart_resize.
Initialization
- static _get_raw_samples(dataset)#
Unwrap dataset wrappers to get the underlying list for direct access.
- _compute_or_load_lengths(dataset)#
Compute token lengths with direct list access for speed.
- static _extract_image_config(processor)#
- static _extract_video_config(processor)#
- _estimate_image_tokens(img_meta)#
Estimate token count for one image from its
[height, width]metadata.
- _estimate_video_tokens(vid_meta)#
Estimate token count for one video from its
[total_frames, height, width, fps, duration]metadata.
- _estimate_tokens(example)#
Return
(text_tokens, media_tokens)for one example.Uses pre-computed
_text_tokens/_media_tokenswhen available (written byscripts/precompute_tokens.py). Otherwise falls back to heuristic estimation.
- set_epoch(epoch)#
Set the epoch for deterministic shuffling (standard PyTorch pattern).
- __iter__()#
- __len__()#