bridge.data.energon.nemotron_omni_task_encoder#
Nemotron Omni Energon task encoder – extends HFEncoderVLMTaskEncoder with audio.
Adds mel spectrogram extraction and <so_embedding> token insertion so that
the training step receives sound_clips / sound_length alongside the
standard vision + language tensors.
Module Contents#
Classes#
Encoded sample for Nemotron Omni (vision + audio + language). |
|
Batched format for Nemotron Omni. |
|
Energon task encoder for Nemotron Omni models. |
Data#
API#
- bridge.data.energon.nemotron_omni_task_encoder.logger#
‘getLogger(…)’
- class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample#
Encoded sample for Nemotron Omni (vision + audio + language).
- __key__: str#
None
- __subflavors__: Dict#
None
- input_ids: torch.Tensor#
None
- labels: torch.Tensor#
None
- loss_mask: torch.Tensor#
None
- visual_tensors: Dict[str, torch.Tensor]#
‘field(…)’
- num_patches: Optional[torch.Tensor]#
None
- sound_clips: Optional[torch.Tensor]#
None
- sound_length: Optional[torch.Tensor]#
None
- imgs_sizes: Optional[torch.Tensor]#
None
- num_frames: Optional[torch.Tensor]#
None
- num_image_tiles: Optional[torch.Tensor]#
None
- class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch#
Bases:
megatron.energon.BatchBatched format for Nemotron Omni.
- __keys__: List[str]#
‘field(…)’
- __subflavors__: List[Dict]#
‘field(…)’
- input_ids: torch.Tensor#
‘field(…)’
- labels: torch.Tensor#
‘field(…)’
- loss_mask: torch.Tensor#
‘field(…)’
- attention_mask: Optional[torch.Tensor]#
‘field(…)’
- position_ids: torch.Tensor#
‘field(…)’
- visual_tensors: Dict[str, Optional[torch.Tensor]]#
‘field(…)’
- num_patches: Optional[torch.Tensor]#
None
- sound_clips: Optional[torch.Tensor]#
None
- sound_length: Optional[torch.Tensor]#
None
- imgs_sizes: Optional[torch.Tensor]#
None
- num_frames: Optional[torch.Tensor]#
None
- num_image_tiles: Optional[torch.Tensor]#
None
- cu_seqlens: Optional[torch.Tensor]#
None
- cu_seqlens_unpadded: Optional[torch.Tensor]#
None
- cu_seqlens_argmin: Optional[torch.Tensor]#
None
- max_seqlen: Optional[torch.Tensor]#
None
- class bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskEncoder(
- processor,
- seq_length: int = 4096,
- max_audio_duration: float = 30.0,
- num_mel_bins: int = 128,
- visual_keys: Sequence[str] = ('pixel_values',),
- temporal_patch_size: int = 2,
- video_fps: float = 1.0,
- video_nframes: int = 8,
- use_temporal_video_embedder: bool = False,
- patch_dim: int = 16,
- pack_sequences: bool = False,
Bases:
megatron.energon.DefaultTaskEncoder[megatron.bridge.data.energon.task_encoder_utils.ChatMLSample,bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskSample,bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch,dict]Energon task encoder for Nemotron Omni models.
Processes ChatML samples that may contain images, videos, AND audio waveforms (decoded from
.wavfields in the WebDataset shards).Audio waveforms are converted to mel spectrograms using
compute_mel_features, and<so_embedding>placeholder tokens are inserted intoinput_idsso thatLLaVAModel.forward()can replace them with the projected sound embeddings.- Parameters:
processor – HF
AutoProcessorfor the Nemotron Omni model.seq_length – Maximum sequence length after tokenization.
max_audio_duration – Maximum audio duration in seconds. Longer clips are truncated.
num_mel_bins – Number of mel frequency bins (must match the sound encoder config, typically 128 for Parakeet).
visual_keys – Processor output keys to capture as visual tensors.
Initialization
- static _decode_video_bytes(
- video_bytes: bytes,
- nframes: int = 8,
- fps: float = 1.0,
Decode raw MP4 bytes to a list of PIL frames.
- _patchify_frame(
- pil_img,
- target_h: int = 512,
- target_w: int = 512,
Convert a PIL image to [num_patches, CPP] patches (normalized).
Matches the HF processor’s normalization (CLIP mean/std).
- property _tokenizer#
- property _pad_token_id: int#
- property _eos_token_id: int#
- property _sound_token_id: int#
- encode_sample(
- sample: megatron.bridge.data.energon.task_encoder_utils.ChatMLSample,
Encode a single ChatML sample with optional audio into model-ready tensors.
- batch( ) bridge.data.energon.nemotron_omni_task_encoder.NemotronOmniTaskBatch#
Pad-and-collate (default) OR pack samples along the seq dim when
pack_sequences=True. Packing emitscu_seqlens/cu_seqlens_unpadded/max_seqlenso TE’s THD kernels handle cross-sample masking (and CP partitioning viathd_get_partitioned_indices) without an attention mask.
- encode_batch( ) dict#
Convert batch to dict for the training step.