nemo_automodel.components.flow_matching.adapters.hunyuan

HunyuanVideo model adapter for FlowMatching Pipeline.

This adapter supports HunyuanVideo 1.5 style models with dual text encoders and image embeddings for image-to-video conditioning.

Module Contents

Classes

Name	Description
`HunyuanAdapter`	Model adapter for HunyuanVideo 1.5 style models.

Functions

Name	Description
`_is_flash_varlen_attention_backend`	-
`enable_hunyuan_flash_varlen_mask_optimization`	Patch Diffusers Hunyuan attention to avoid dense mask construction for flash-varlen attention.

Data

logger

API

class nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter(
    default_image_embed_shape: typing.Tuple[int, int] = (729, 1152),
    use_condition_latents: bool = True
)

Bases: ModelAdapter

Model adapter for HunyuanVideo 1.5 style models.

These models use:

Condition latents concatenated with noisy latents
Dual text encoders with attention masks
Image embeddings for i2v

Expected batch keys:

text_embeddings: Primary text encoder output [B, seq_len, dim]
text_mask: Attention mask for primary encoder [B, seq_len] (optional)
text_embeddings_2: Secondary text encoder output [B, seq_len, dim] (optional)
text_mask_2: Attention mask for secondary encoder [B, seq_len] (optional)
image_embeds: Image embeddings for i2v [B, seq_len, dim] (optional)

nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter.forward(
    model: torch.nn.Module,
    inputs: typing.Dict[str, typing.Any]
) -> torch.Tensor

Execute forward pass for HunyuanVideo model.

Parameters:

model

nn.Module

HunyuanVideo model

inputs

Dict[str, Any]

Dictionary from prepare_inputs()

Returns: torch.Tensor

Model prediction tensor

nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter.get_condition_latents(
    latents: torch.Tensor,
    task_type: str
) -> torch.Tensor

Generate conditional latents based on task type.

Parameters:

latents

torch.Tensor

Input latents [B, C, F, H, W]

task_type

str

Task type (“t2v” or “i2v”)

Returns: torch.Tensor

Conditional latents [B, C+1, F, H, W]

nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter.prepare_inputs(
    context: nemo_automodel.components.flow_matching.adapters.base.FlowMatchingContext
) -> typing.Dict[str, typing.Any]

Prepare inputs for HunyuanVideo model.

Parameters:

context

FlowMatchingContext

FlowMatchingContext with batch data

Returns: Dict[str, Any]

Dictionary containing:

nemo_automodel.components.flow_matching.adapters.hunyuan._is_flash_varlen_attention_backend(
    backend: typing.Any
) -> bool

nemo_automodel.components.flow_matching.adapters.hunyuan.enable_hunyuan_flash_varlen_mask_optimization() -> bool

Patch Diffusers Hunyuan attention to avoid dense mask construction for flash-varlen attention.

nemo_automodel.components.flow_matching.adapters.hunyuan.logger = logging.getLogger(__name__)