nemo_automodel.components.flow_matching.adapters.hunyuan#

HunyuanVideo model adapter for FlowMatching Pipeline.

This adapter supports HunyuanVideo 1.5 style models with dual text encoders and image embeddings for image-to-video conditioning.

Module Contents#

Classes#

HunyuanAdapter

Model adapter for HunyuanVideo 1.5 style models.

API#

class nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter(
default_image_embed_shape: Tuple[int, int] = (729, 1152),
use_condition_latents: bool = True,
)#

Bases: nemo_automodel.components.flow_matching.adapters.base.ModelAdapter

Model adapter for HunyuanVideo 1.5 style models.

These models use:

  • Condition latents concatenated with noisy latents

  • Dual text encoders with attention masks

  • Image embeddings for i2v

Expected batch keys:

  • text_embeddings: Primary text encoder output [B, seq_len, dim]

  • text_mask: Attention mask for primary encoder [B, seq_len] (optional)

  • text_embeddings_2: Secondary text encoder output [B, seq_len, dim] (optional)

  • text_mask_2: Attention mask for secondary encoder [B, seq_len] (optional)

  • image_embeds: Image embeddings for i2v [B, seq_len, dim] (optional)

.. rubric:: Example

adapter = HunyuanAdapter() pipeline = FlowMatchingPipelineV2(model_adapter=adapter)

Initialization

Initialize the HunyuanAdapter.

Parameters:
  • default_image_embed_shape – Default shape for image embeddings (seq_len, dim) when not provided in batch. Defaults to (729, 1152).

  • use_condition_latents – Whether to concatenate condition latents with noisy latents. Defaults to True.

get_condition_latents(
latents: torch.Tensor,
task_type: str,
) torch.Tensor#

Generate conditional latents based on task type.

Parameters:
  • latents – Input latents [B, C, F, H, W]

  • task_type – Task type (ā€œt2vā€ or ā€œi2vā€)

Returns:

Conditional latents [B, C+1, F, H, W]

prepare_inputs(
context: nemo_automodel.components.flow_matching.adapters.base.FlowMatchingContext,
) Dict[str, Any]#

Prepare inputs for HunyuanVideo model.

Parameters:

context – FlowMatchingContext with batch data

Returns:

  • latents: Noisy latents (optionally concatenated with condition latents)

  • timesteps: Timestep values

  • encoder_hidden_states: Primary text embeddings

  • encoder_attention_mask: Primary attention mask

  • encoder_hidden_states_2: Secondary text embeddings

  • encoder_attention_mask_2: Secondary attention mask

  • image_embeds: Image embeddings

Return type:

Dictionary containing

forward(
model: torch.nn.Module,
inputs: Dict[str, Any],
) torch.Tensor#

Execute forward pass for HunyuanVideo model.

Parameters:
  • model – HunyuanVideo model

  • inputs – Dictionary from prepare_inputs()

Returns:

Model prediction tensor