nemo_automodel.components.flow_matching.adapters.qwen_image
nemo_automodel.components.flow_matching.adapters.qwen_image
Qwen-Image model adapter for FlowMatching Pipeline.
This adapter supports Qwen/Qwen-Image style T2I models with:
- Qwen2 text embeddings (text_embeddings)
- 2D image latents ([B, C, H, W])
- 2x2 patch packing similar to Flux
Module Contents
Classes
API
Bases: ModelAdapter
Model adapter for Qwen-Image text-to-image models.
Supports batch format from multiresolution dataloader:
- image_latents: [B, C, H, W]
- text_embeddings: Qwen2 embeddings [B, seq_len, hidden_dim]
Qwen-Image transformer forward interface:
- hidden_states: Packed latents [B, num_patches, C*4]
- encoder_hidden_states: Qwen2 text embeddings [B, seq_len, hidden_dim]
- encoder_hidden_states_mask: Attention mask (None for flash attention)
- timestep: Normalized timesteps [0, 1]
- img_shapes: List of image shape tuples [[(1, h//2, w//2)]] per sample
- guidance: Optional guidance scale embedding [B]
Pack latents from [B, C, H, W] to [B, (H//2)(W//2), C4].
Uses 2x2 patch grouping to match the transformer’s patch embedding.
Unpack latents from [B, num_patches, channels] back to [B, C, H, W].
Parameters:
Packed latents of shape [B, num_patches, channels]
Original image height in pixels
Original image width in pixels
VAE compression factor (default: 8)
Execute forward pass for Qwen-Image model.
Returns unpacked prediction in [B, C, H, W] format.
Prepare inputs for Qwen-Image model from FlowMatchingContext.
Expects 4D image latents: [B, C, H, W]