nemo_automodel.components.flow_matching.adapters.qwen_image

View as Markdown

Qwen-Image model adapter for FlowMatching Pipeline.

This adapter supports Qwen/Qwen-Image style T2I models with:

  • Qwen2 text embeddings (text_embeddings)
  • 2D image latents ([B, C, H, W])
  • 2x2 patch packing similar to Flux

Module Contents

Classes

NameDescription
QwenImageAdapterModel adapter for Qwen-Image text-to-image models.

API

class nemo_automodel.components.flow_matching.adapters.qwen_image.QwenImageAdapter(
guidance_scale: float = 3.5,
use_guidance_embeds: bool = False
)

Bases: ModelAdapter

Model adapter for Qwen-Image text-to-image models.

Supports batch format from multiresolution dataloader:

  • image_latents: [B, C, H, W]
  • text_embeddings: Qwen2 embeddings [B, seq_len, hidden_dim]

Qwen-Image transformer forward interface:

  • hidden_states: Packed latents [B, num_patches, C*4]
  • encoder_hidden_states: Qwen2 text embeddings [B, seq_len, hidden_dim]
  • encoder_hidden_states_mask: Attention mask (None for flash attention)
  • timestep: Normalized timesteps [0, 1]
  • img_shapes: List of image shape tuples [[(1, h//2, w//2)]] per sample
  • guidance: Optional guidance scale embedding [B]
nemo_automodel.components.flow_matching.adapters.qwen_image.QwenImageAdapter._pack_latents(
latents: torch.Tensor
) -> torch.Tensor

Pack latents from [B, C, H, W] to [B, (H//2)(W//2), C4].

Uses 2x2 patch grouping to match the transformer’s patch embedding.

nemo_automodel.components.flow_matching.adapters.qwen_image.QwenImageAdapter._unpack_latents(
latents: torch.Tensor,
height: int,
width: int,
vae_scale_factor: int = 8
) -> torch.Tensor
staticmethod

Unpack latents from [B, num_patches, channels] back to [B, C, H, W].

Parameters:

latents
torch.Tensor

Packed latents of shape [B, num_patches, channels]

height
int

Original image height in pixels

width
int

Original image width in pixels

vae_scale_factor
intDefaults to 8

VAE compression factor (default: 8)

nemo_automodel.components.flow_matching.adapters.qwen_image.QwenImageAdapter.forward(
model: torch.nn.Module,
inputs: typing.Dict[str, typing.Any]
) -> torch.Tensor

Execute forward pass for Qwen-Image model.

Returns unpacked prediction in [B, C, H, W] format.

nemo_automodel.components.flow_matching.adapters.qwen_image.QwenImageAdapter.prepare_inputs(
context: nemo_automodel.components.flow_matching.adapters.base.FlowMatchingContext
) -> typing.Dict[str, typing.Any]

Prepare inputs for Qwen-Image model from FlowMatchingContext.

Expects 4D image latents: [B, C, H, W]