`nemo_automodel.components.flow_matching.adapters.qwen_image`#

Qwen-Image model adapter for FlowMatching Pipeline.

This adapter supports Qwen/Qwen-Image style T2I models with:

Qwen2 text embeddings (text_embeddings)
2D image latents ([B, C, H, W])
2x2 patch packing similar to Flux

Module Contents#

Classes#

QwenImageAdapter

Model adapter for Qwen-Image text-to-image models.

API#

class nemo_automodel.components.flow_matching.adapters.qwen_image.QwenImageAdapter( guidance_scale: float = 3.5, use_guidance_embeds: bool = False, )#

Bases: nemo_automodel.components.flow_matching.adapters.base.ModelAdapter

Model adapter for Qwen-Image text-to-image models.

Supports batch format from multiresolution dataloader:

image_latents: [B, C, H, W]
text_embeddings: Qwen2 embeddings [B, seq_len, hidden_dim]

Qwen-Image transformer forward interface:

hidden_states: Packed latents [B, num_patches, C*4]
encoder_hidden_states: Qwen2 text embeddings [B, seq_len, hidden_dim]
encoder_hidden_states_mask: Attention mask (None for flash attention)
timestep: Normalized timesteps [0, 1]
img_shapes: List of image shape tuples [[(1, h//2, w//2)]] per sample
guidance: Optional guidance scale embedding [B]

Initialization

Initialize QwenImageAdapter.

Parameters:

guidance_scale – Guidance scale for classifier-free guidance
use_guidance_embeds – Whether to use guidance embeddings

_pack_latents(latents: torch.Tensor) → torch.Tensor#

Pack latents from [B, C, H, W] to [B, (H//2)(W//2), C4].

Uses 2x2 patch grouping to match the transformer’s patch embedding.

static _unpack_latents( latents: torch.Tensor, height: int, width: int, vae_scale_factor: int = 8, ) → torch.Tensor#

Unpack latents from [B, num_patches, channels] back to [B, C, H, W].

Parameters:

latents – Packed latents of shape [B, num_patches, channels]
height – Original image height in pixels
width – Original image width in pixels
vae_scale_factor – VAE compression factor (default: 8)

prepare_inputs( context: nemo_automodel.components.flow_matching.adapters.base.FlowMatchingContext, ) → Dict[str, Any]#

Prepare inputs for Qwen-Image model from FlowMatchingContext.

Expects 4D image latents: [B, C, H, W]

forward( model: torch.nn.Module, inputs: Dict[str, Any], ) → torch.Tensor#

Execute forward pass for Qwen-Image model.

Returns unpacked prediction in [B, C, H, W] format.

nemo_automodel.components.flow_matching.adapters.qwen_image#

Module Contents#

Classes#

API#

`nemo_automodel.components.flow_matching.adapters.qwen_image`#