`bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion`#

Generation utilities for Megatron GPTModel with NemotronLabsDiffusionAttention inference support.

Supports:

AR (autoregressive) generation with KV cache
dLLM (block diffusion) generation with prefix cache + iterative denoising

The KV cache lives inside each layer’s NemotronLabsDiffusionAttention via its _inference_mode / kv_cache* attributes. No Megatron InferenceContext is used.

Module Contents#

Functions#

`_unwrap`	Unwrap Float16Module, DDP, or VLM wrappers to get the raw GPTModel.
`_get_core_attentions`	Return list of NemotronLabsDiffusionAttention modules from the Megatron GPT model.
`_set_inference_mode`
`_set_inference_params`
`_clear_kv_cache`
`set_tp_group`	Set the TP process group for _model_forward token broadcasts.
`_tp_send_cmd`	Broadcast a command to TP followers (no-op if TP=1).
`_broadcast_tensor`	Broadcast shape then data so all peers have an identically-shaped tensor.
`_model_forward`	Call GPTModel.forward with minimal args (no inference_context).
`generate_ar`	Standard left-to-right autoregressive generation with KV cache.
`generate_dllm`	Block-diffusion generation with prefix KV cache.
`tp_send_stop`	Tell TP followers to exit their loop.
`tp_follower_loop`	Blocking loop for TP-non-rank-0 processes.

Data#

`__all__`
`_TP_GROUP`
`_TP_SRC_GLOBAL_RANK`
`_CMD_FORWARD`
`_CMD_SET_INF_MODE_ON`
`_CMD_SET_INF_MODE_OFF`
`_CMD_SET_PARAMS`
`_CMD_CLEAR_CACHE`
`_CMD_STOP`

API#

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.__all__#: [‘add_gumbel_noise’, ‘get_num_transfer_tokens’, ‘get_transfer_index’, ‘generate_ar’, ‘generate_dllm’…

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._unwrap(model)#: Unwrap Float16Module, DDP, or VLM wrappers to get the raw GPTModel.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._get_core_attentions(model)#: Return list of NemotronLabsDiffusionAttention modules from the Megatron GPT model.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._set_inference_mode(model, enabled: bool)#

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._set_inference_params(model, causal: bool, cache_enabled: bool)#

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._clear_kv_cache(model)#

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._TP_GROUP#: None

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._TP_SRC_GLOBAL_RANK#: 0

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_FORWARD#: 1

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_INF_MODE_ON#: 2

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_INF_MODE_OFF#: 3

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_PARAMS#: 4

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_CLEAR_CACHE#: 5

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_STOP#: 0

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.set_tp_group(group, src_global_rank=0)#: Set the TP process group for _model_forward token broadcasts.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._tp_send_cmd(cmd, extra=None)#: Broadcast a command to TP followers (no-op if TP=1).

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._broadcast_tensor(tensor, src, group)#: Broadcast shape then data so all peers have an identically-shaped tensor.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._model_forward(model, input_ids)#

Call GPTModel.forward with minimal args (no inference_context).

When a TP group is set, broadcasts input_ids from the TP-rank-0 process so all TP peers call forward with identical inputs.

Parameters:

model – Megatron GPTModel (already on CUDA).
input_ids – [batch, seq_len] token ids.

Returns:

[batch, seq_len, vocab_size]

Return type:

logits

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.generate_ar( model, prompt: torch.Tensor, max_new_tokens: int = 128, temperature: float = 0.0, eos_token_id: int = None, )#

Standard left-to-right autoregressive generation with KV cache.

Parameters:

model – Megatron GPTModel on CUDA.
prompt – [batch, prompt_len] token ids.
max_new_tokens – number of tokens to generate.
temperature – sampling temperature (0 = greedy).
eos_token_id – stop generation when this token is produced.

Returns:

[batch, prompt_len + new_tokens] full sequence.

Return type:

generated

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.generate_dllm( model, prompt: torch.Tensor, gen_length: int = 128, block_length: int = 32, steps: int = 128, temperature: float = 0.0, remasking: str = 'low_confidence', mask_id: int = 100, threshold: float = None, shift_logits: bool = True, neg_entropy: bool = True, )#

Block-diffusion generation with prefix KV cache.

Replicates generate_with_prefix_cache_block_diff_sbd from the original eval.py but uses NemotronLabsDiffusionAttention’s inference mode instead of HF past_key_values.

Parameters:

model – Megatron GPTModel on CUDA.
prompt – [batch, prompt_len] token ids.
gen_length – total number of tokens to generate.
block_length – size of each denoising block.
steps – total denoising steps across all blocks.
temperature – sampling temperature for Gumbel noise.
remasking – remasking strategy (“low_confidence” or “random”).
mask_id – mask token id.
threshold – optional denoising confidence threshold.
shift_logits – if True, use dream-style shifted logits (position i-1 predicts token i, i.e. next-token prediction). If False, each masked position’s logits predict its own token directly. For dLLM this should typically be False; for AR-style models use True.
neg_entropy – if True, use negative entropy for confidence scoring.

Returns:

[batch, prompt_len + gen_length] full sequence with generated tokens. nfe: number of forward evaluations used.

Return type:

x_accum

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.tp_send_stop()#: Tell TP followers to exit their loop.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.tp_follower_loop(model)#

Blocking loop for TP-non-rank-0 processes.

Waits for commands from the TP-rank-0 process and mirrors all model operations (set_inference_mode, set_inference_params, clear_kv_cache, model forward) so Megatron TP communication stays in sync.

bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion#

Module Contents#

Functions#

Data#

API#

`bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion`#