bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion#

Generation utilities for Megatron GPTModel with NemotronLabsDiffusionAttention inference support.

Supports:

  • AR (autoregressive) generation with KV cache

  • dLLM (block diffusion) generation with prefix cache + iterative denoising

The KV cache lives inside each layer’s NemotronLabsDiffusionAttention via its _inference_mode / kv_cache* attributes. No Megatron InferenceContext is used.

Module Contents#

Functions#

add_gumbel_noise

Apply Gumbel noise to logits for stochastic sampling.

get_num_transfer_tokens

Compute the number of tokens to unmask at each diffusion step.

get_transfer_index

Select token indices to transfer from masked to unmasked at each diffusion step.

_unwrap

Unwrap Float16Module, DDP, or VLM wrappers to get the raw GPTModel.

_get_core_attentions

Return list of NemotronLabsDiffusionAttention modules from the Megatron GPT model.

_set_inference_mode

_set_inference_params

_clear_kv_cache

set_tp_group

Set the TP process group for _model_forward token broadcasts.

_tp_send_cmd

Broadcast a command to TP followers (no-op if TP=1).

_broadcast_tensor

Broadcast shape then data so all peers have an identically-shaped tensor.

_model_forward

Call GPTModel.forward with minimal args (no inference_context).

generate_ar

Standard left-to-right autoregressive generation with KV cache.

generate_dllm

Block-diffusion generation with prefix KV cache.

tp_send_stop

Tell TP followers to exit their loop.

tp_follower_loop

Blocking loop for TP-non-rank-0 processes.

Data#

API#

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.add_gumbel_noise(logits, temperature)#

Apply Gumbel noise to logits for stochastic sampling.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.get_num_transfer_tokens(mask_index, steps)#

Compute the number of tokens to unmask at each diffusion step.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.get_transfer_index(
logits,
temperature,
remasking,
mask_index,
x,
num_transfer_tokens,
threshold=None,
neg_entropy=False,
)#

Select token indices to transfer from masked to unmasked at each diffusion step.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._unwrap(model)#

Unwrap Float16Module, DDP, or VLM wrappers to get the raw GPTModel.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._get_core_attentions(model)#

Return list of NemotronLabsDiffusionAttention modules from the Megatron GPT model.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._set_inference_mode(model, enabled: bool)#
bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._set_inference_params(model, causal: bool, cache_enabled: bool)#
bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._clear_kv_cache(model)#
bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._TP_GROUP#

None

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._TP_SRC_GLOBAL_RANK#

0

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_FORWARD#

1

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_INF_MODE_ON#

2

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_INF_MODE_OFF#

3

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_PARAMS#

4

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_CLEAR_CACHE#

5

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_STOP#

0

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.set_tp_group(group, src_global_rank=0)#

Set the TP process group for _model_forward token broadcasts.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._tp_send_cmd(cmd, extra=None)#

Broadcast a command to TP followers (no-op if TP=1).

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._broadcast_tensor(tensor, src, group)#

Broadcast shape then data so all peers have an identically-shaped tensor.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._model_forward(model, input_ids)#

Call GPTModel.forward with minimal args (no inference_context).

When a TP group is set, broadcasts input_ids from the TP-rank-0 process so all TP peers call forward with identical inputs.

Parameters:
  • model – Megatron GPTModel (already on CUDA).

  • input_ids – [batch, seq_len] token ids.

Returns:

[batch, seq_len, vocab_size]

Return type:

logits

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.generate_ar(
model,
prompt: torch.Tensor,
max_new_tokens: int = 128,
temperature: float = 0.0,
eos_token_id: int = None,
)#

Standard left-to-right autoregressive generation with KV cache.

Parameters:
  • model – Megatron GPTModel on CUDA.

  • prompt – [batch, prompt_len] token ids.

  • max_new_tokens – number of tokens to generate.

  • temperature – sampling temperature (0 = greedy).

  • eos_token_id – stop generation when this token is produced.

Returns:

[batch, prompt_len + new_tokens] full sequence.

Return type:

generated

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.generate_dllm(
model,
prompt: torch.Tensor,
gen_length: int = 128,
block_length: int = 32,
steps: int = 128,
temperature: float = 0.0,
remasking: str = 'low_confidence',
mask_id: int = 100,
threshold: float = None,
shift_logits: bool = True,
neg_entropy: bool = True,
)#

Block-diffusion generation with prefix KV cache.

Replicates generate_with_prefix_cache_block_diff_sbd from the original eval.py but uses NemotronLabsDiffusionAttention’s inference mode instead of HF past_key_values.

Parameters:
  • model – Megatron GPTModel on CUDA.

  • prompt – [batch, prompt_len] token ids.

  • gen_length – total number of tokens to generate.

  • block_length – size of each denoising block.

  • steps – total denoising steps across all blocks.

  • temperature – sampling temperature for Gumbel noise.

  • remasking – remasking strategy (“low_confidence” or “random”).

  • mask_id – mask token id.

  • threshold – optional denoising confidence threshold.

  • shift_logits – if True, use dream-style shifted logits (position i-1 predicts token i, i.e. next-token prediction). If False, each masked position’s logits predict its own token directly. For dLLM this should typically be False; for AR-style models use True.

  • neg_entropy – if True, use negative entropy for confidence scoring.

Returns:

[batch, prompt_len + gen_length] full sequence with generated tokens. nfe: number of forward evaluations used.

Return type:

x_accum

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.tp_send_stop()#

Tell TP followers to exit their loop.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.tp_follower_loop(model)#

Blocking loop for TP-non-rank-0 processes.

Waits for commands from the TP-rank-0 process and mirrors all model operations (set_inference_mode, set_inference_params, clear_kv_cache, model forward) so Megatron TP communication stays in sync.