bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion#
Generation utilities for Megatron GPTModel with NemotronLabsDiffusionAttention inference support.
Supports:
AR (autoregressive) generation with KV cache
dLLM (block diffusion) generation with prefix cache + iterative denoising
The KV cache lives inside each layer’s NemotronLabsDiffusionAttention via its _inference_mode / kv_cache* attributes. No Megatron InferenceContext is used.
Module Contents#
Functions#
Apply Gumbel noise to logits for stochastic sampling. |
|
Compute the number of tokens to unmask at each diffusion step. |
|
Select token indices to transfer from masked to unmasked at each diffusion step. |
|
Unwrap Float16Module, DDP, or VLM wrappers to get the raw GPTModel. |
|
Return list of NemotronLabsDiffusionAttention modules from the Megatron GPT model. |
|
Set the TP process group for _model_forward token broadcasts. |
|
Broadcast a command to TP followers (no-op if TP=1). |
|
Broadcast shape then data so all peers have an identically-shaped tensor. |
|
Call GPTModel.forward with minimal args (no inference_context). |
|
Standard left-to-right autoregressive generation with KV cache. |
|
Block-diffusion generation with prefix KV cache. |
|
Tell TP followers to exit their loop. |
|
Blocking loop for TP-non-rank-0 processes. |
Data#
API#
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.add_gumbel_noise(logits, temperature)#
Apply Gumbel noise to logits for stochastic sampling.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.get_num_transfer_tokens(mask_index, steps)#
Compute the number of tokens to unmask at each diffusion step.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.get_transfer_index(
- logits,
- temperature,
- remasking,
- mask_index,
- x,
- num_transfer_tokens,
- threshold=None,
- neg_entropy=False,
Select token indices to transfer from masked to unmasked at each diffusion step.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._unwrap(model)#
Unwrap Float16Module, DDP, or VLM wrappers to get the raw GPTModel.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._get_core_attentions(model)#
Return list of NemotronLabsDiffusionAttention modules from the Megatron GPT model.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._set_inference_mode(model, enabled: bool)#
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._set_inference_params(model, causal: bool, cache_enabled: bool)#
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._clear_kv_cache(model)#
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._TP_GROUP#
None
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._TP_SRC_GLOBAL_RANK#
0
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_FORWARD#
1
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_INF_MODE_ON#
2
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_INF_MODE_OFF#
3
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_SET_PARAMS#
4
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_CLEAR_CACHE#
5
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._CMD_STOP#
0
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.set_tp_group(group, src_global_rank=0)#
Set the TP process group for _model_forward token broadcasts.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._tp_send_cmd(cmd, extra=None)#
Broadcast a command to TP followers (no-op if TP=1).
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._broadcast_tensor(tensor, src, group)#
Broadcast shape then data so all peers have an identically-shaped tensor.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion._model_forward(model, input_ids)#
Call GPTModel.forward with minimal args (no inference_context).
When a TP group is set, broadcasts input_ids from the TP-rank-0 process so all TP peers call forward with identical inputs.
- Parameters:
model – Megatron GPTModel (already on CUDA).
input_ids – [batch, seq_len] token ids.
- Returns:
[batch, seq_len, vocab_size]
- Return type:
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.generate_ar(
- model,
- prompt: torch.Tensor,
- max_new_tokens: int = 128,
- temperature: float = 0.0,
- eos_token_id: int = None,
Standard left-to-right autoregressive generation with KV cache.
- Parameters:
model – Megatron GPTModel on CUDA.
prompt – [batch, prompt_len] token ids.
max_new_tokens – number of tokens to generate.
temperature – sampling temperature (0 = greedy).
eos_token_id – stop generation when this token is produced.
- Returns:
[batch, prompt_len + new_tokens] full sequence.
- Return type:
generated
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.generate_dllm(
- model,
- prompt: torch.Tensor,
- gen_length: int = 128,
- block_length: int = 32,
- steps: int = 128,
- temperature: float = 0.0,
- remasking: str = 'low_confidence',
- mask_id: int = 100,
- threshold: float = None,
- shift_logits: bool = True,
- neg_entropy: bool = True,
Block-diffusion generation with prefix KV cache.
Replicates generate_with_prefix_cache_block_diff_sbd from the original eval.py but uses NemotronLabsDiffusionAttention’s inference mode instead of HF past_key_values.
- Parameters:
model – Megatron GPTModel on CUDA.
prompt – [batch, prompt_len] token ids.
gen_length – total number of tokens to generate.
block_length – size of each denoising block.
steps – total denoising steps across all blocks.
temperature – sampling temperature for Gumbel noise.
remasking – remasking strategy (“low_confidence” or “random”).
mask_id – mask token id.
threshold – optional denoising confidence threshold.
shift_logits – if True, use dream-style shifted logits (position i-1 predicts token i, i.e. next-token prediction). If False, each masked position’s logits predict its own token directly. For dLLM this should typically be False; for AR-style models use True.
neg_entropy – if True, use negative entropy for confidence scoring.
- Returns:
[batch, prompt_len + gen_length] full sequence with generated tokens. nfe: number of forward evaluations used.
- Return type:
x_accum
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.tp_send_stop()#
Tell TP followers to exit their loop.
- bridge.diffusion.models.nemotron_labs_diffusion.inference_nemotron_labs_diffusion.tp_follower_loop(model)#
Blocking loop for TP-non-rank-0 processes.
Waits for commands from the TP-rank-0 process and mirrors all model operations (set_inference_mode, set_inference_params, clear_kv_cache, model forward) so Megatron TP communication stays in sync.