bridge.data.energon.nemotron_omni_task_encoder#

Nemotron Omni Energon task encoder – extends HFEncoderVLMTaskEncoder with audio.

Adds mel spectrogram extraction and <so_embedding> token insertion so that the training step receives sound_clips / sound_length alongside the standard vision + language tensors.

Module Contents#

Classes#

NemotronOmniTaskSample

Encoded sample for Nemotron Omni (vision + audio + language).

NemotronOmniTaskBatch

Batched format for Nemotron Omni.

NemotronOmniTaskEncoder

Energon task encoder for Nemotron Omni models.

Data#

API#

bridge.data.energon.nemotron_omni_task_encoder.logger#

‘getLogger(…)’

class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample#

Encoded sample for Nemotron Omni (vision + audio + language).

__key__: str#

None

__subflavors__: Dict#

None

input_ids: torch.Tensor#

None

labels: torch.Tensor#

None

loss_mask: torch.Tensor#

None

visual_tensors: Dict[str, torch.Tensor]#

‘field(…)’

num_patches: Optional[torch.Tensor]#

None

sound_clips: Optional[torch.Tensor]#

None

sound_length: Optional[torch.Tensor]#

None

imgs_sizes: Optional[torch.Tensor]#

None

num_frames: Optional[torch.Tensor]#

None

num_image_tiles: Optional[torch.Tensor]#

None

class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch#

Bases: megatron.energon.Batch

Batched format for Nemotron Omni.

__keys__: List[str]#

‘field(…)’

__subflavors__: List[Dict]#

‘field(…)’

input_ids: torch.Tensor#

‘field(…)’

labels: torch.Tensor#

‘field(…)’

loss_mask: torch.Tensor#

‘field(…)’

attention_mask: Optional[torch.Tensor]#

‘field(…)’

position_ids: torch.Tensor#

‘field(…)’

visual_tensors: Dict[str, Optional[torch.Tensor]]#

‘field(…)’

num_patches: Optional[torch.Tensor]#

None

sound_clips: Optional[torch.Tensor]#

None

sound_length: Optional[torch.Tensor]#

None

imgs_sizes: Optional[torch.Tensor]#

None

num_frames: Optional[torch.Tensor]#

None

num_image_tiles: Optional[torch.Tensor]#

None

cu_seqlens: Optional[torch.Tensor]#

None

cu_seqlens_unpadded: Optional[torch.Tensor]#

None

cu_seqlens_argmin: Optional[torch.Tensor]#

None

max_seqlen: Optional[torch.Tensor]#

None

class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskEncoder(
processor,
seq_length: int = 4096,
max_audio_duration: float = 30.0,
num_mel_bins: int = 128,
visual_keys: Sequence[str] = ('pixel_values',),
temporal_patch_size: int = 2,
video_fps: float = 1.0,
video_nframes: int = 8,
use_temporal_video_embedder: bool = False,
patch_dim: int = 16,
pack_sequences: bool = False,
)#

Bases: megatron.energon.DefaultTaskEncoder[megatron.bridge.data.energon.task_encoder_utils.ChatMLSample, bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample, bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch, dict]

Energon task encoder for Nemotron Omni models.

Processes ChatML samples that may contain images, videos, AND audio waveforms (decoded from .wav fields in the WebDataset shards).

Audio waveforms are converted to mel spectrograms using compute_mel_features, and <so_embedding> placeholder tokens are inserted into input_ids so that LLaVAModel.forward() can replace them with the projected sound embeddings.

Parameters:
  • processor – HF AutoProcessor for the Nemotron Omni model.

  • seq_length – Maximum sequence length after tokenization.

  • max_audio_duration – Maximum audio duration in seconds. Longer clips are truncated.

  • num_mel_bins – Number of mel frequency bins (must match the sound encoder config, typically 128 for Parakeet).

  • visual_keys – Processor output keys to capture as visual tensors.

Initialization

static _decode_video_bytes(
video_bytes: bytes,
nframes: int = 8,
fps: float = 1.0,
)#

Decode raw MP4 bytes to a list of PIL frames.

_patchify_frame(
pil_img,
target_h: int = 512,
target_w: int = 512,
) torch.Tensor#

Convert a PIL image to [num_patches, CPP] patches (normalized).

Matches the HF processor’s normalization (CLIP mean/std).

property _tokenizer#
property _pad_token_id: int#
property _eos_token_id: int#
property _sound_token_id: int#
encode_sample(
sample: megatron.bridge.data.energon.task_encoder_utils.ChatMLSample,
) bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample#

Encode a single ChatML sample with optional audio into model-ready tensors.

batch(
samples: List[bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample],
) bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch#

Pad-and-collate (default) OR pack samples along the seq dim when pack_sequences=True. Packing emits cu_seqlens / cu_seqlens_unpadded / max_seqlen so TE’s THD kernels handle cross-sample masking (and CP partitioning via thd_get_partitioned_indices) without an attention mask.

encode_batch(
batch: bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch,
) dict#

Convert batch to dict for the training step.