nemo_automodel.components.datasets.vlm.collate_fns#
Module Contents#
Functions#
Decode a single token id across tokenizer implementations. |
|
Construct label and optional loss-mask tensors aligned to assistant responses. |
|
Collate function for Phi-4 MM model audio input |
|
Collate function for Qwen2.5 VL model. |
|
Collate function for Qwen3 Omni processors. |
|
Collate function for KimiVL processors. |
|
Expand single image placeholder tokens to the correct number based on grid_thws. |
|
Collate function for Kimi K2.5 VL processors with pre-expanded image tokens. |
|
Collate function for NVIDIA Nemotron-Parse models. |
|
Default collate function for multimodal VLM datasets. |
Data#
API#
- nemo_automodel.components.datasets.vlm.collate_fns.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.vlm.collate_fns._find_pattern_indices(
- template,
- pattern,
- search_start_index=0,
- allow_first_token_mismatch=False,
- nemo_automodel.components.datasets.vlm.collate_fns._extract_assistant_text(message: Dict[str, Any]) str#
- nemo_automodel.components.datasets.vlm.collate_fns._decode_single_token(tokenizer, token_id: int) str#
Decode a single token id across tokenizer implementations.
Some tokenizers accept an
inttoken id, while others require a sequence of ids (e.g.,List[int]). We try the common forms in order.
- nemo_automodel.components.datasets.vlm.collate_fns.build_labels(
- input_ids_batch: torch.Tensor,
- conversations: Sequence[Sequence[Dict[str, Any]]],
- processor,
Construct label and optional loss-mask tensors aligned to assistant responses.
- nemo_automodel.components.datasets.vlm.collate_fns.phi4_mm_collate_fn(examples, processor)#
Collate function for Phi-4 MM model audio input
- nemo_automodel.components.datasets.vlm.collate_fns.qwen2_5_collate_fn(
- examples: list,
- processor,
Collate function for Qwen2.5 VL model.
- nemo_automodel.components.datasets.vlm.collate_fns.qwen3_omni_collate_fn(
- examples: Sequence[Dict[str, Any]],
- processor,
- use_audio_in_video: bool = False,
Collate function for Qwen3 Omni processors.
- nemo_automodel.components.datasets.vlm.collate_fns.kimi_vl_collate_fn(
- examples: Sequence[Dict[str, Any]],
- processor,
- max_length: Optional[int] = None,
Collate function for KimiVL processors.
- nemo_automodel.components.datasets.vlm.collate_fns._expand_image_tokens(
- input_ids: torch.Tensor,
- attention_mask: torch.Tensor,
- grid_thws: torch.Tensor,
- media_token_id: int,
- merge_kernel_size: Tuple[int, int] = (2, 2),
Expand single image placeholder tokens to the correct number based on grid_thws.
For PP, this ensures the sequence length is fixed BEFORE the model forward pass, eliminating dynamic sequence expansion inside the model.
Assumes 1 image per sample (1 placeholder per sequence).
- Parameters:
input_ids – (seq_len,) tensor with 1 media_token_id placeholder
attention_mask – (seq_len,) tensor
grid_thws – (1, 3) tensor with [t, h, w] for the single image
media_token_id – Token ID of the image placeholder
merge_kernel_size – Vision tower’s patch merge kernel, default (2, 2)
- Returns:
Input IDs with placeholder expanded to N tokens expanded_attention_mask: Attention mask expanded accordingly
- Return type:
expanded_input_ids
- nemo_automodel.components.datasets.vlm.collate_fns.kimi_k25_vl_collate_fn(
- examples: Sequence[Dict[str, Any]],
- processor,
- max_length: Optional[int] = None,
Collate function for Kimi K2.5 VL processors with pre-expanded image tokens.
For pipeline parallelism, this function:
Processes each sample to get input_ids with 1 placeholder per image
Pre-expands the placeholder to N tokens (N = (h//2)*(w//2) from grid_thws)
Pads all sequences to fixed max_length This ensures the model forward pass doesn’t change sequence length dynamically.
- nemo_automodel.components.datasets.vlm.collate_fns.nemotron_parse_collate_fn(
- examples: Sequence[Dict[str, Any]],
- processor,
- task_prompt: str = '</s><s><predict_bbox><predict_classes><output_markdown>',
Collate function for NVIDIA Nemotron-Parse models.
The Nemotron-Parse processor does not expose a chat template, so we build the prompt + answer string manually, mask the prompt tokens, and keep the image preprocessing handled by the processor.
- nemo_automodel.components.datasets.vlm.collate_fns.default_collate_fn(
- examples: Sequence[Dict[str, Any]],
- processor,
- max_length: Optional[int] = None,
Default collate function for multimodal VLM datasets.
- nemo_automodel.components.datasets.vlm.collate_fns.COLLATE_FNS#
None