`bridge.data.energon.nemotron_omni_task_encoder`#

Nemotron Omni Energon task encoder – extends HFTaskEncoder with audio.

Adds mel spectrogram extraction and <so_embedding> token insertion so that the training step receives sound_clips / sound_length alongside the standard vision + language tensors.

Module Contents#

Classes#

`NemotronOmniTaskSample`	Encoded sample for Nemotron Omni (vision + audio + language).
`NemotronOmniTaskBatch`	Batched format for Nemotron Omni.
`NemotronOmniTaskEncoder`	Energon task encoder for Nemotron Omni models.

Data#

logger

API#

bridge.data.energon.nemotron_omni_task_encoder.logger#: ‘getLogger(…)’

class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample#

Encoded sample for Nemotron Omni (vision + audio + language).

__key__: str#: None

__subflavors__: Dict#: None

input_ids: torch.Tensor#: None

labels: torch.Tensor#: None

loss_mask: torch.Tensor#: None

visual_tensors: Dict[str, torch.Tensor]#: ‘field(…)’

num_patches: Optional[torch.Tensor]#: None

sound_clips: Optional[torch.Tensor]#: None

sound_length: Optional[torch.Tensor]#: None

imgs_sizes: Optional[torch.Tensor]#: None

num_frames: Optional[torch.Tensor]#: None

num_image_tiles: Optional[torch.Tensor]#: None

class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch#

Bases: megatron.energon.Batch

Batched format for Nemotron Omni.

__keys__: List[str]#: ‘field(…)’

__subflavors__: List[Dict]#: ‘field(…)’

input_ids: torch.Tensor#: ‘field(…)’

labels: torch.Tensor#: ‘field(…)’

loss_mask: torch.Tensor#: ‘field(…)’

attention_mask: Optional[torch.Tensor]#: ‘field(…)’

position_ids: torch.Tensor#: ‘field(…)’

visual_tensors: Dict[str, Optional[torch.Tensor]]#: ‘field(…)’

num_patches: Optional[torch.Tensor]#: None

sound_clips: Optional[torch.Tensor]#: None

sound_length: Optional[torch.Tensor]#: None

imgs_sizes: Optional[torch.Tensor]#: None

num_frames: Optional[torch.Tensor]#: None

num_image_tiles: Optional[torch.Tensor]#: None

cu_seqlens_q: Optional[torch.Tensor]#: None

cu_seqlens_kv: Optional[torch.Tensor]#: None

cu_seqlens_q_padded: Optional[torch.Tensor]#: None

cu_seqlens_kv_padded: Optional[torch.Tensor]#: None

max_seqlen_q: Optional[torch.Tensor]#: None

max_seqlen_kv: Optional[torch.Tensor]#: None

class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskEncoder( processor, seq_length: int = 4096, max_audio_duration: float = 30.0, num_mel_bins: int = 128, visual_keys: Sequence[str] = ('pixel_values',), temporal_patch_size: int = 2, video_fps: float = 1.0, video_nframes: int = 8, use_temporal_video_embedder: bool = False, patch_dim: int = 16, pad_to_max_length: bool = False, pad_to_multiple_of: int = 128, enable_in_batch_packing: bool = False, in_batch_packing_pad_to_multiple_of: int = 1, )#

Bases: megatron.energon.DefaultTaskEncoder[megatron.bridge.data.energon.task_encoder_utils.ChatMLSample, bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample, bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch, dict]

Energon task encoder for Nemotron Omni models.

Processes ChatML samples that may contain images, videos, AND audio waveforms (decoded from .wav fields in the WebDataset shards).

Audio waveforms are converted to mel spectrograms using compute_mel_features, and <so_embedding> placeholder tokens are inserted into input_ids so that LLaVAModel.forward() can replace them with the projected sound embeddings.

Parameters:

processor – HF AutoProcessor for the Nemotron Omni model.
seq_length – Maximum sequence length after tokenization.
max_audio_duration – Maximum audio duration in seconds. Longer clips are truncated.
num_mel_bins – Number of mel frequency bins (must match the sound encoder config, typically 128 for Parakeet).
visual_keys – Processor output keys to capture as visual tensors.
pad_to_max_length – Whether collate-time padding should pad non-packed batches to seq_length when supported.
pad_to_multiple_of – Non-packed collate-time padding multiple used when pad_to_max_length is false and supported.
enable_in_batch_packing – Whether to do in-batch sequence packing.
in_batch_packing_pad_to_multiple_of – Per-sample padding multiple used only by the in-batch packed path, typically to satisfy CP/SP divisibility.

Initialization

static _decode_video_bytes( video_bytes: bytes, nframes: int = 8, fps: float = 1.0, )#: Decode raw MP4 bytes to a list of PIL frames.

_patchify_frame( pil_img, target_h: int = 512, target_w: int = 512, ) → torch.Tensor#

Convert a PIL image to [num_patches, CPP] patches (normalized).

Matches the HF processor’s normalization (CLIP mean/std).

property _tokenizer#

property _pad_token_id: int#

property _eos_token_id: int#

property _sound_token_id: int#

encode_sample( sample: megatron.bridge.data.energon.task_encoder_utils.ChatMLSample, ) → bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample#: Encode a single ChatML sample with optional audio into model-ready tensors.

batch( samples: List[bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample], ) → bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch#: Pad-and-collate (default) OR pack samples along the seq dim when enable_in_batch_packing=True. Packing emits current MCore packed-sequence metadata so TE’s THD kernels handle cross-sample masking without an attention mask.

encode_batch( batch: bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch, ) → dict#: Convert batch to dict for the training step.

bridge.data.energon.nemotron_omni_task_encoder#

Module Contents#

Classes#

Data#

API#

`bridge.data.energon.nemotron_omni_task_encoder`#