nemo_automodel.components.flow_matching.adapters.hunyuan#
HunyuanVideo model adapter for FlowMatching Pipeline.
This adapter supports HunyuanVideo 1.5 style models with dual text encoders and image embeddings for image-to-video conditioning.
Module Contents#
Classes#
Model adapter for HunyuanVideo 1.5 style models. |
API#
- class nemo_automodel.components.flow_matching.adapters.hunyuan.HunyuanAdapter(
- default_image_embed_shape: Tuple[int, int] = (729, 1152),
- use_condition_latents: bool = True,
Bases:
nemo_automodel.components.flow_matching.adapters.base.ModelAdapterModel adapter for HunyuanVideo 1.5 style models.
These models use:
Condition latents concatenated with noisy latents
Dual text encoders with attention masks
Image embeddings for i2v
Expected batch keys:
text_embeddings: Primary text encoder output [B, seq_len, dim]
text_mask: Attention mask for primary encoder [B, seq_len] (optional)
text_embeddings_2: Secondary text encoder output [B, seq_len, dim] (optional)
text_mask_2: Attention mask for secondary encoder [B, seq_len] (optional)
image_embeds: Image embeddings for i2v [B, seq_len, dim] (optional)
.. rubric:: Example
adapter = HunyuanAdapter() pipeline = FlowMatchingPipelineV2(model_adapter=adapter)
Initialization
Initialize the HunyuanAdapter.
- Parameters:
default_image_embed_shape ā Default shape for image embeddings (seq_len, dim) when not provided in batch. Defaults to (729, 1152).
use_condition_latents ā Whether to concatenate condition latents with noisy latents. Defaults to True.
- get_condition_latents(
- latents: torch.Tensor,
- task_type: str,
Generate conditional latents based on task type.
- Parameters:
latents ā Input latents [B, C, F, H, W]
task_type ā Task type (āt2vā or āi2vā)
- Returns:
Conditional latents [B, C+1, F, H, W]
- prepare_inputs( ) Dict[str, Any]#
Prepare inputs for HunyuanVideo model.
- Parameters:
context ā FlowMatchingContext with batch data
- Returns:
latents: Noisy latents (optionally concatenated with condition latents)
timesteps: Timestep values
encoder_hidden_states: Primary text embeddings
encoder_attention_mask: Primary attention mask
encoder_hidden_states_2: Secondary text embeddings
encoder_attention_mask_2: Secondary attention mask
image_embeds: Image embeddings
- Return type:
Dictionary containing
- forward(
- model: torch.nn.Module,
- inputs: Dict[str, Any],
Execute forward pass for HunyuanVideo model.
- Parameters:
model ā HunyuanVideo model
inputs ā Dictionary from prepare_inputs()
- Returns:
Model prediction tensor