`nemo_automodel.components.flow_matching.adapters.hunyuan`#

HunyuanVideo model adapter for FlowMatching Pipeline.

This adapter supports HunyuanVideo 1.5 style models with dual text encoders and image embeddings for image-to-video conditioning.

Module Contents#

Classes#

HunyuanAdapter

Model adapter for HunyuanVideo 1.5 style models.

API#

class nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter( default_image_embed_shape: Tuple[int, int] = (729, 1152), use_condition_latents: bool = True, )#

Bases: nemo_automodel.components.flow_matching.adapters.base.ModelAdapter

Model adapter for HunyuanVideo 1.5 style models.

These models use:

Condition latents concatenated with noisy latents
Dual text encoders with attention masks
Image embeddings for i2v

Expected batch keys:

text_embeddings: Primary text encoder output [B, seq_len, dim]
text_mask: Attention mask for primary encoder [B, seq_len] (optional)
text_embeddings_2: Secondary text encoder output [B, seq_len, dim] (optional)
text_mask_2: Attention mask for secondary encoder [B, seq_len] (optional)
image_embeds: Image embeddings for i2v [B, seq_len, dim] (optional)

.. rubric:: Example

adapter = HunyuanAdapter() pipeline = FlowMatchingPipelineV2(model_adapter=adapter)

Initialization

Initialize the HunyuanAdapter.

Parameters:

default_image_embed_shape – Default shape for image embeddings (seq_len, dim) when not provided in batch. Defaults to (729, 1152).
use_condition_latents – Whether to concatenate condition latents with noisy latents. Defaults to True.

get_condition_latents( latents: torch.Tensor, task_type: str, ) → torch.Tensor#

Generate conditional latents based on task type.

Parameters:

latents – Input latents [B, C, F, H, W]
task_type – Task type (“t2v” or “i2v”)

Returns:

Conditional latents [B, C+1, F, H, W]

prepare_inputs( context: nemo_automodel.components.flow_matching.adapters.base.FlowMatchingContext, ) → Dict[str, Any]#

Prepare inputs for HunyuanVideo model.

Parameters:

context – FlowMatchingContext with batch data

Returns:

latents: Noisy latents (optionally concatenated with condition latents)
timesteps: Timestep values
encoder_hidden_states: Primary text embeddings
encoder_attention_mask: Primary attention mask
encoder_hidden_states_2: Secondary text embeddings
encoder_attention_mask_2: Secondary attention mask
image_embeds: Image embeddings

Return type:

Dictionary containing

forward( model: torch.nn.Module, inputs: Dict[str, Any], ) → torch.Tensor#

Execute forward pass for HunyuanVideo model.

Parameters:

model – HunyuanVideo model
inputs – Dictionary from prepare_inputs()

Returns:

Model prediction tensor

nemo_automodel.components.flow_matching.adapters.hunyuan#

Module Contents#

Classes#

API#

`nemo_automodel.components.flow_matching.adapters.hunyuan`#