bridge.training.nemotron_omni_step#

Nemotron Omni training step – extends llava_step with sound support.

Adds sound_clips and sound_length to the model forward kwargs so that LLaVAModel processes audio embeddings alongside vision embeddings.

Module Contents#

Functions#

get_batch_from_iterator

Get a batch of data from the iterator, including optional sound tensors.

_resolve_images

Extract images from either raw pixel_values or GenericVisualInputs container.

_build_vision_packed_seq_params

Build vision PackedSeqParams from per-frame (H, W).

get_batch

Generate a batch with vision and sound tensors.

forward_step

Forward training step for Nemotron Omni (vision + audio + language).

Data#

API#

bridge.training.nemotron_omni_step.logger#

‘getLogger(…)’

bridge.training.nemotron_omni_step.get_batch_from_iterator(
data_iterator: Iterable,
skip_getting_attention_mask_from_dataset: bool = True,
*,
is_first_pp_stage: bool,
is_last_pp_stage: bool,
) dict[str, torch.Tensor]#

Get a batch of data from the iterator, including optional sound tensors.

Handles two batch formats:

  • HF collate path: raw pixel_values, num_patches keys

  • Energon path: visual_inputs (GenericVisualInputs container) Both carry sound_clips / sound_length when audio is present.

bridge.training.nemotron_omni_step._resolve_images(batch: dict) torch.Tensor | None#

Extract images from either raw pixel_values or GenericVisualInputs container.

bridge.training.nemotron_omni_step._VISION_PATCH_DIM#

16

bridge.training.nemotron_omni_step._build_vision_packed_seq_params(
imgs_sizes: torch.Tensor | None,
) megatron.core.packed_seq_params.PackedSeqParams | None#

Build vision PackedSeqParams from per-frame (H, W).

RADIO’s dynamic-resolution + class-token path reads packed_seq_params.cu_seqlens_q to insert class tokens at per-image boundaries. We build cu_seqlens from the pre-grouping imgs_sizes (one entry per frame); _apply_temporal_grouping rebuilds it after tubelet fusion.

bridge.training.nemotron_omni_step.get_batch(
data_iterator: Iterable,
cfg: megatron.bridge.training.config.ConfigContainer,
*,
pg_collection,
) tuple#

Generate a batch with vision and sound tensors.

bridge.training.nemotron_omni_step.forward_step(
state: megatron.bridge.training.state.GlobalState,
data_iterator: Iterable,
model: megatron.core.models.gpt.GPTModel,
return_schedule_plan: bool = False,
) tuple[torch.Tensor, functools.partial]#

Forward training step for Nemotron Omni (vision + audio + language).