`nemo_automodel.components.datasets.vlm.collate_fns`#

Module Contents#

Functions#

`_find_pattern_indices`
`_extract_assistant_text`
`_decode_single_token`	Decode a single token id across tokenizer implementations.
`build_labels`	Construct label and optional loss-mask tensors aligned to assistant responses.
`phi4_mm_collate_fn`	Collate function for Phi-4 MM model audio input
`qwen2_5_collate_fn`	Collate function for Qwen2.5 VL model.
`qwen3_omni_collate_fn`	Collate function for Qwen3 Omni processors.
`kimi_vl_collate_fn`	Collate function for KimiVL processors.
`_expand_image_tokens`	Expand single image placeholder tokens to the correct number based on grid_thws.
`kimi_k25_vl_collate_fn`	Collate function for Kimi K2.5 VL processors with pre-expanded image tokens.
`nemotron_parse_collate_fn`	Collate function for NVIDIA Nemotron-Parse models.
`default_collate_fn`	Default collate function for multimodal VLM datasets.

Data#

`logger`
`COLLATE_FNS`

API#

nemo_automodel.components.datasets.vlm.collate_fns.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.vlm.collate_fns._find_pattern_indices( template, pattern, search_start_index=0, allow_first_token_mismatch=False, )#

nemo_automodel.components.datasets.vlm.collate_fns._extract_assistant_text(message: Dict[str, Any]) → str#

nemo_automodel.components.datasets.vlm.collate_fns._decode_single_token(tokenizer, token_id: int) → str#

Decode a single token id across tokenizer implementations.

Some tokenizers accept an int token id, while others require a sequence of ids (e.g., List[int]). We try the common forms in order.

nemo_automodel.components.datasets.vlm.collate_fns.build_labels( input_ids_batch: torch.Tensor, conversations: Sequence[Sequence[Dict[str, Any]]], processor, ) → torch.Tensor#: Construct label and optional loss-mask tensors aligned to assistant responses.

nemo_automodel.components.datasets.vlm.collate_fns.phi4_mm_collate_fn(examples, processor)#: Collate function for Phi-4 MM model audio input

nemo_automodel.components.datasets.vlm.collate_fns.qwen2_5_collate_fn( examples: list, processor, ) → dict[str, torch.Tensor]#: Collate function for Qwen2.5 VL model.

nemo_automodel.components.datasets.vlm.collate_fns.qwen3_omni_collate_fn( examples: Sequence[Dict[str, Any]], processor, use_audio_in_video: bool = False, ) → Dict[str, torch.Tensor]#: Collate function for Qwen3 Omni processors.

nemo_automodel.components.datasets.vlm.collate_fns.kimi_vl_collate_fn( examples: Sequence[Dict[str, Any]], processor, max_length: Optional[int] = None, ) → Dict[str, torch.Tensor]#: Collate function for KimiVL processors.

nemo_automodel.components.datasets.vlm.collate_fns._expand_image_tokens( input_ids: torch.Tensor, attention_mask: torch.Tensor, grid_thws: torch.Tensor, media_token_id: int, merge_kernel_size: Tuple[int, int] = (2, 2), ) → Tuple[torch.Tensor, torch.Tensor]#

Expand single image placeholder tokens to the correct number based on grid_thws.

For PP, this ensures the sequence length is fixed BEFORE the model forward pass, eliminating dynamic sequence expansion inside the model.

Assumes 1 image per sample (1 placeholder per sequence).

Parameters:

input_ids – (seq_len,) tensor with 1 media_token_id placeholder
attention_mask – (seq_len,) tensor
grid_thws – (1, 3) tensor with [t, h, w] for the single image
media_token_id – Token ID of the image placeholder
merge_kernel_size – Vision tower’s patch merge kernel, default (2, 2)

Returns:

Input IDs with placeholder expanded to N tokens expanded_attention_mask: Attention mask expanded accordingly

Return type:

expanded_input_ids

nemo_automodel.components.datasets.vlm.collate_fns.kimi_k25_vl_collate_fn( examples: Sequence[Dict[str, Any]], processor, max_length: Optional[int] = None, ) → Dict[str, torch.Tensor]#

Collate function for Kimi K2.5 VL processors with pre-expanded image tokens.

For pipeline parallelism, this function:

Processes each sample to get input_ids with 1 placeholder per image
Pre-expands the placeholder to N tokens (N = (h//2)*(w//2) from grid_thws)
Pads all sequences to fixed max_length This ensures the model forward pass doesn’t change sequence length dynamically.

nemo_automodel.components.datasets.vlm.collate_fns.nemotron_parse_collate_fn( examples: Sequence[Dict[str, Any]], processor, task_prompt: str = '</s><s><predict_bbox><predict_classes><output_markdown>', ) → Dict[str, torch.Tensor]#

Collate function for NVIDIA Nemotron-Parse models.

The Nemotron-Parse processor does not expose a chat template, so we build the prompt + answer string manually, mask the prompt tokens, and keep the image preprocessing handled by the processor.

nemo_automodel.components.datasets.vlm.collate_fns.default_collate_fn( examples: Sequence[Dict[str, Any]], processor, max_length: Optional[int] = None, ) → Dict[str, torch.Tensor]#: Default collate function for multimodal VLM datasets.

nemo_automodel.components.datasets.vlm.collate_fns.COLLATE_FNS#: None

nemo_automodel.components.datasets.vlm.collate_fns#

Module Contents#

Functions#

Data#

API#

`nemo_automodel.components.datasets.vlm.collate_fns`#