nemo_automodel.components.models.diffusion_gemma.model

NeMo Automodel support for diffusion_gemma (block diffusion).

Architecture (design v2 item 1) — ONE shared parameter stack run twice:

Run the decoder layers once causally over the clean full sequence to build a per-layer read-only KV cache (the “encoder” KV). The text encoder is causal because use_bidirectional_attention == "vision" (not "all"); a single causal pass over the clean full sequence reproduces the per-position KV that block-by-block inference builds.
Run the same layers once bidirectionally over the noised canvas (the response region), each layer concatenating [encoder_KV ; canvas_KV] on the key axis and using the block-causal training mask from attention_mask.build_block_diffusion_training_mask.

A single shared stack (rather than tied-but-separate encoder/decoder modules) keeps the model visible to AM’s MoE FSDP grad-sync (MoEFSDPSyncMixin / _iter_fsdp_modules assume a single model.layers stack with block.moe.experts) and avoids FSDP2 double-sharding tied storage. The lm_head is tied to model.embed_tokens.

Self-conditioning (decoder-only, Analog-Bits two-pass) is encapsulated in the training forward so the recipe still calls model(**batch) once.

Module Contents

Classes

Name	Description
`DiffusionGemmaBackbone`	Single shared Gemma MoE transformer stack run causally then bidirectionally.
`DiffusionGemmaForBlockDiffusion`	Block-diffusion Gemma MoE model for SFT.
`DiffusionGemmaOutput`	Training forward output.

Functions

Name	Description
`_make_causal_additive_mask`	Build an additive causal (optionally sliding-window) mask for the encoder.
`_make_missing`	-

Data

ModelClass

_TRANSFORMERS_AVAILABLE

API

class nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaBackbone(
    config: transformers.models.diffusion_gemma.configuration_diffusion_gemma.DiffusionGemmaTextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None
)

Bases: Module

Single shared Gemma MoE transformer stack run causally then bidirectionally.

Exposes layers (a ModuleDict keyed by string layer index), embed_tokens, norm, self_conditioning and rotary_emb. The layers / embed_tokens names are what MoEFSDPSyncMixin and the FSDP2 sharding path key on.

embed_tokens

layer_types

= config.layer_types

layers

moe_config

= _build_moe_config(config, moe_config)

norm

padding_idx

= getattr(config, 'pad_token_id', None)

rotary_emb

= DiffusionGemmaTextRotaryEmbedding(config)

self_conditioning

= DiffusionGemmaSelfConditioning(config)

vocab_size

= config.vocab_size

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaBackbone._position_embeddings(
    hidden_states: torch.Tensor,
    position_ids: torch.Tensor
) -> dict

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaBackbone.decode(
    canvas_ids: torch.Tensor,
    encoder_kv: list[tuple[torch.Tensor, torch.Tensor]],
    decoder_position_ids: torch.Tensor,
    decoder_masks: dict,
    decoder_padding_mask: torch.Tensor | None = None,
    self_conditioning_logits: torch.Tensor | None = None,
    self_conditioning_mask: torch.Tensor | None = None
) -> torch.Tensor

Bidirectional pass over the noised canvas with cross-attention to the encoder KV cache. Returns the final (normed) hidden states.

self_conditioning_mask ([B] bool, training only) gates the self-cond branch PER EXAMPLE: examples with False get a zeroed soft-embedding (identical to the no-self-cond path), so a single always-on pass-1 can serve Google’s per-example conditioned / zero-conditioned mix.

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaBackbone.encode(
    input_ids: torch.Tensor,
    position_ids: torch.Tensor,
    padding_mask: torch.Tensor | None,
    return_hidden: bool = False
)

Causal pass over the clean full sequence -> per-layer (K, V) cache.

When return_hidden is True, also returns the final normed hidden states [B, S, H] (so the caller can produce the encoder’s autoregressive logits for the co-trained AR loss). Default False keeps the KV-only contract used by inference and the parity/leakage tests.

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaBackbone.forward(
    mode: str,
    input_ids: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    return_hidden: bool = False,
    canvas_ids: torch.Tensor | None = None,
    encoder_kv: list[tuple[torch.Tensor, torch.Tensor]] | None = None,
    decoder_position_ids: torch.Tensor | None = None,
    decoder_masks: dict | None = None,
    decoder_padding_mask: torch.Tensor | None = None,
    self_conditioning_logits: torch.Tensor | None = None,
    self_conditioning_mask: torch.Tensor | None = None
) -> list[tuple[torch.Tensor, torch.Tensor]] | torch.Tensor

Dispatch encode/decode through nn.Module.__call__ for FSDP hooks.

FSDP2 hooks are installed on module calls, not on arbitrary helper methods. The block-diffusion top-level forward must therefore enter the backbone via self.model(...) so root-owned parameters such as self_conditioning and the final norm are gathered before use.

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaBackbone.get_input_embeddings() -> torch.nn.Module

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaBackbone.set_input_embeddings(
    value: torch.nn.Module
) -> None

class nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaForBlockDiffusion(
    config: transformers.models.diffusion_gemma.configuration_diffusion_gemma.DiffusionGemmaConfig,
    moe_config: 'MoEConfig | None' = None,
    backend: 'BackendConfig | None' = None,
    canvas_length: int | None = None,
    self_conditioning: bool | None = None,
    freeze_router: bool | None = None,
    kwargs: typing.Any = {}
)

Bases: HFCheckpointingMixin, MoEFSDPSyncMixin, PreTrainedModel

Block-diffusion Gemma MoE model for SFT.

Inherits the AM checkpointing + MoE-FSDP machinery. The MoE backbone is reused from gemma4_moe; the diffusion training forward and the two-pass self-conditioning are new. See module docstring for the single-shared-stack design.

forward is the SFT training forward. A generation/inference loop (encode the prompt once, then iteratively denoise canvas blocks reusing the KV cache, with the self-conditioning recycling loop) is deferred; the model.encode / model.decode building blocks are the reusable pieces for it, and forward already accepts an explicit self_conditioning_logits for the per-step inference contract.

_keep_in_fp32_modules

= ['rotary_emb']

_no_split_modules

= ['DiffusionGemmaMoEDecoderLayer']

_tied_weights_keys

= ['lm_head.weight']

backend

= backend or BackendConfig()

base_model_prefix

= 'model'

canvas_length

= int(getattr(config, 'canvas_length', 256))

final_logit_softcapping

= text_config.final_logit_softcapping

freeze_router

lm_head

model

moe_config

= self.model.moe_config

self_conditioning

state_dict_adapter

vocab_size

= text_config.vocab_size

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaForBlockDiffusion._softcap_logits(
    hidden_states: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.diffusion_gemma.model.DiffusionGemmaForBlockDiffusion.forward(
    input_ids: torch.Tensor | None = None,
    canvas_ids: torch.Tensor | None = None,
    self_conditioning_logits: torch.Tensor | None = None,
    encoder_position_ids: torch.Tensor | None = None,
    encoder_padding_mask: torch.Tensor | None = None,
    decoder_position_ids: torch.Tensor | None = None,
    decoder_attention_mask: dict | None = None,
    decoder_padding_mask: torch.Tensor | None = None,
    do_self_conditioning: torch.Tensor | bool | None = None,
    kwargs: typing.Any = {}
) -> 'DiffusionGemmaOutput'

Training forward — single shared stack run twice + two-pass self-cond.

Parameters:

input_ids