nemo_automodel.components.flow_matching.adapters.hunyuan

View as Markdown

HunyuanVideo model adapter for FlowMatching Pipeline.

This adapter supports HunyuanVideo 1.5 style models with dual text encoders and image embeddings for image-to-video conditioning.

Module Contents

Classes

NameDescription
HunyuanAdapterModel adapter for HunyuanVideo 1.5 style models.

Functions

NameDescription
_is_flash_varlen_attention_backend-
enable_hunyuan_flash_varlen_mask_optimizationPatch Diffusers Hunyuan attention to avoid dense mask construction for flash-varlen attention.

Data

logger

API

class nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter(
default_image_embed_shape: typing.Tuple[int, int] = (729, 1152),
use_condition_latents: bool = True
)

Bases: ModelAdapter

Model adapter for HunyuanVideo 1.5 style models.

These models use:

  • Condition latents concatenated with noisy latents
  • Dual text encoders with attention masks
  • Image embeddings for i2v

Expected batch keys:

  • text_embeddings: Primary text encoder output [B, seq_len, dim]
  • text_mask: Attention mask for primary encoder [B, seq_len] (optional)
  • text_embeddings_2: Secondary text encoder output [B, seq_len, dim] (optional)
  • text_mask_2: Attention mask for secondary encoder [B, seq_len] (optional)
  • image_embeds: Image embeddings for i2v [B, seq_len, dim] (optional)
nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter.forward(
model: torch.nn.Module,
inputs: typing.Dict[str, typing.Any]
) -> torch.Tensor

Execute forward pass for HunyuanVideo model.

Parameters:

model
nn.Module

HunyuanVideo model

inputs
Dict[str, Any]

Dictionary from prepare_inputs()

Returns: torch.Tensor

Model prediction tensor

nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter.get_condition_latents(
latents: torch.Tensor,
task_type: str
) -> torch.Tensor

Generate conditional latents based on task type.

Parameters:

latents
torch.Tensor

Input latents [B, C, F, H, W]

task_type
str

Task type (“t2v” or “i2v”)

Returns: torch.Tensor

Conditional latents [B, C+1, F, H, W]

nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter.prepare_inputs(
context: nemo_automodel.components.flow_matching.adapters.base.FlowMatchingContext
) -> typing.Dict[str, typing.Any]

Prepare inputs for HunyuanVideo model.

Parameters:

context
FlowMatchingContext

FlowMatchingContext with batch data

Returns: Dict[str, Any]

Dictionary containing:

nemo_automodel.components.flow_matching.adapters.hunyuan._is_flash_varlen_attention_backend(
backend: typing.Any
) -> bool
nemo_automodel.components.flow_matching.adapters.hunyuan.enable_hunyuan_flash_varlen_mask_optimization() -> bool

Patch Diffusers Hunyuan attention to avoid dense mask construction for flash-varlen attention.

nemo_automodel.components.flow_matching.adapters.hunyuan.logger = logging.getLogger(__name__)