nemo_automodel.components.datasets.vlm.collate_fns#

Module Contents#

Functions#

_find_pattern_indices

_extract_assistant_text

_decode_single_token

Decode a single token id across tokenizer implementations.

build_labels

Construct label and optional loss-mask tensors aligned to assistant responses.

phi4_mm_collate_fn

Collate function for Phi-4 MM model audio input

qwen2_5_collate_fn

Collate function for Qwen2.5 VL model.

qwen3_omni_collate_fn

Collate function for Qwen3 Omni processors.

kimi_vl_collate_fn

Collate function for KimiVL processors.

_expand_image_tokens

Expand single image placeholder tokens to the correct number based on grid_thws.

kimi_k25_vl_collate_fn

Collate function for Kimi K2.5 VL processors with pre-expanded image tokens.

nemotron_parse_collate_fn

Collate function for NVIDIA Nemotron-Parse models.

default_collate_fn

Default collate function for multimodal VLM datasets.

Data#

API#

nemo_automodel.components.datasets.vlm.collate_fns.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.vlm.collate_fns._find_pattern_indices(
template,
pattern,
search_start_index=0,
allow_first_token_mismatch=False,
)#
nemo_automodel.components.datasets.vlm.collate_fns._extract_assistant_text(message: Dict[str, Any]) str#
nemo_automodel.components.datasets.vlm.collate_fns._decode_single_token(tokenizer, token_id: int) str#

Decode a single token id across tokenizer implementations.

Some tokenizers accept an int token id, while others require a sequence of ids (e.g., List[int]). We try the common forms in order.

nemo_automodel.components.datasets.vlm.collate_fns.build_labels(
input_ids_batch: torch.Tensor,
conversations: Sequence[Sequence[Dict[str, Any]]],
processor,
) torch.Tensor#

Construct label and optional loss-mask tensors aligned to assistant responses.

nemo_automodel.components.datasets.vlm.collate_fns.phi4_mm_collate_fn(examples, processor)#

Collate function for Phi-4 MM model audio input

nemo_automodel.components.datasets.vlm.collate_fns.qwen2_5_collate_fn(
examples: list,
processor,
) dict[str, torch.Tensor]#

Collate function for Qwen2.5 VL model.

nemo_automodel.components.datasets.vlm.collate_fns.qwen3_omni_collate_fn(
examples: Sequence[Dict[str, Any]],
processor,
use_audio_in_video: bool = False,
) Dict[str, torch.Tensor]#

Collate function for Qwen3 Omni processors.

nemo_automodel.components.datasets.vlm.collate_fns.kimi_vl_collate_fn(
examples: Sequence[Dict[str, Any]],
processor,
max_length: Optional[int] = None,
) Dict[str, torch.Tensor]#

Collate function for KimiVL processors.

nemo_automodel.components.datasets.vlm.collate_fns._expand_image_tokens(
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
grid_thws: torch.Tensor,
media_token_id: int,
merge_kernel_size: Tuple[int, int] = (2, 2),
) Tuple[torch.Tensor, torch.Tensor]#

Expand single image placeholder tokens to the correct number based on grid_thws.

For PP, this ensures the sequence length is fixed BEFORE the model forward pass, eliminating dynamic sequence expansion inside the model.

Assumes 1 image per sample (1 placeholder per sequence).

Parameters:
  • input_ids – (seq_len,) tensor with 1 media_token_id placeholder

  • attention_mask – (seq_len,) tensor

  • grid_thws – (1, 3) tensor with [t, h, w] for the single image

  • media_token_id – Token ID of the image placeholder

  • merge_kernel_size – Vision tower’s patch merge kernel, default (2, 2)

Returns:

Input IDs with placeholder expanded to N tokens expanded_attention_mask: Attention mask expanded accordingly

Return type:

expanded_input_ids

nemo_automodel.components.datasets.vlm.collate_fns.kimi_k25_vl_collate_fn(
examples: Sequence[Dict[str, Any]],
processor,
max_length: Optional[int] = None,
) Dict[str, torch.Tensor]#

Collate function for Kimi K2.5 VL processors with pre-expanded image tokens.

For pipeline parallelism, this function:

  1. Processes each sample to get input_ids with 1 placeholder per image

  2. Pre-expands the placeholder to N tokens (N = (h//2)*(w//2) from grid_thws)

  3. Pads all sequences to fixed max_length This ensures the model forward pass doesn’t change sequence length dynamically.

nemo_automodel.components.datasets.vlm.collate_fns.nemotron_parse_collate_fn(
examples: Sequence[Dict[str, Any]],
processor,
task_prompt: str = '</s><s><predict_bbox><predict_classes><output_markdown>',
) Dict[str, torch.Tensor]#

Collate function for NVIDIA Nemotron-Parse models.

The Nemotron-Parse processor does not expose a chat template, so we build the prompt + answer string manually, mask the prompt tokens, and keep the image preprocessing handled by the processor.

nemo_automodel.components.datasets.vlm.collate_fns.default_collate_fn(
examples: Sequence[Dict[str, Any]],
processor,
max_length: Optional[int] = None,
) Dict[str, torch.Tensor]#

Default collate function for multimodal VLM datasets.

nemo_automodel.components.datasets.vlm.collate_fns.COLLATE_FNS#

None