bridge.models.nemotron_omni.data.collate_fn#

Nemotron Omni collator implementations.

Module Contents#

Functions#

nemotron_omni_collate_fn

Collate function for Nemotron Omni model (vision + audio + language).

Data#

API#

bridge.models.nemotron_omni.data.collate_fn.CHATML_ASSISTANT_START#

β€˜<|im_start|>assistant\n’

bridge.models.nemotron_omni.data.collate_fn.CHATML_TURN_END#

β€˜<|im_end|>’

bridge.models.nemotron_omni.data.collate_fn.nemotron_omni_collate_fn(
examples: list,
processor,
start_of_response_token=None,
*,
pack_sequences: bool = False,
) dict[str, torch.Tensor]#

Collate function for Nemotron Omni model (vision + audio + language).

Extends nemotron_nano_v2_vl_collate_fn with audio support. Each example may carry an audio_path field pointing to a 16 kHz mono WAV file. Audio is converted to mel spectrograms and added to the batch as sound_clips / sound_length tensors consumed by LLaVAModel.forward().

When pack_sequences=True, samples in the microbatch are concatenated along the sequence dim into a single [1, sum(L_i)] batch, and cu_seqlens / cu_seqlens_unpadded / cu_seqlens_argmin / max_seqlen are emitted so TE’s THD attention kernels handle per-sample masking without an attention mask. Requires mbs > 1 to be meaningful.