bridge.training.nemotron_omni_step#
Nemotron Omni training step – extends llava_step with sound support.
Adds sound_clips and sound_length to the model forward kwargs so that
LLaVAModel processes audio embeddings alongside vision embeddings.
Module Contents#
Functions#
Get a batch of data from the iterator, including optional sound tensors. |
|
Extract images from either raw pixel_values or GenericVisualInputs container. |
|
Build vision PackedSeqParams from per-frame (H, W). |
|
Generate a batch with vision and sound tensors. |
|
Forward training step for Nemotron Omni (vision + audio + language). |
Data#
API#
- bridge.training.nemotron_omni_step.logger#
‘getLogger(…)’
- bridge.training.nemotron_omni_step.get_batch_from_iterator(
- data_iterator: Iterable,
- skip_getting_attention_mask_from_dataset: bool = True,
- *,
- is_first_pp_stage: bool,
- is_last_pp_stage: bool,
Get a batch of data from the iterator, including optional sound tensors.
Handles two batch formats:
HF collate path: raw
pixel_values,num_patcheskeysEnergon path:
visual_inputs(GenericVisualInputs container) Both carrysound_clips/sound_lengthwhen audio is present.
- bridge.training.nemotron_omni_step._resolve_images(batch: dict) torch.Tensor | None#
Extract images from either raw pixel_values or GenericVisualInputs container.
- bridge.training.nemotron_omni_step._VISION_PATCH_DIM#
16
- bridge.training.nemotron_omni_step._build_vision_packed_seq_params(
- imgs_sizes: torch.Tensor | None,
Build vision PackedSeqParams from per-frame (H, W).
RADIO’s dynamic-resolution + class-token path reads
packed_seq_params.cu_seqlens_qto insert class tokens at per-image boundaries. We build cu_seqlens from the pre-groupingimgs_sizes(one entry per frame);_apply_temporal_groupingrebuilds it after tubelet fusion.
- bridge.training.nemotron_omni_step.get_batch(
- data_iterator: Iterable,
- cfg: megatron.bridge.training.config.ConfigContainer,
- *,
- pg_collection,
Generate a batch with vision and sound tensors.
- bridge.training.nemotron_omni_step.forward_step(
- state: megatron.bridge.training.state.GlobalState,
- data_iterator: Iterable,
- model: megatron.core.models.gpt.GPTModel,
- return_schedule_plan: bool = False,
Forward training step for Nemotron Omni (vision + audio + language).