`bridge.diffusion.common.dllm`#

Diffusion language model utilities: masking, block attention masks, and sampling.

The sampling primitives (add_gumbel_noise, get_num_transfer_tokens, get_transfer_index) implement the iterative-denoising step shared by every block-diffusion / masked-dLLM generation loop in this repo (NemotronLabsDiffusion, LLaDA1.5, …). They are model-agnostic: each model keeps its own generation loop with its own attention semantics (causal-with-KV-cache vs fully bidirectional) but calls these helpers to score confidence and choose which masked positions to unmask at each step.

Module Contents#

Functions#

`forward_process_simple_masking`	Uniform random masking for diffusion LM training.
`add_gumbel_noise`	Apply Gumbel noise to logits for stochastic sampling.
`get_num_transfer_tokens`	Compute how many masked tokens to unmask at each diffusion step.
`get_transfer_index`	Select which masked positions to unmask at one diffusion step.
`compute_block_mask`	Compute the sbd_block_diff attention mask.

API#

bridge.diffusion.common.dllm.forward_process_simple_masking( input_ids, mask_token_id, eps=0.001, loss_mask=None, generator=None, )#

Uniform random masking for diffusion LM training.

For each sequence in the batch, sample a masking ratio t ~ U(eps, 1) and independently mask each token with probability t.

Returns:: input_ids with masked positions replaced by mask_token_id masked_indices: boolean mask of shape (b, l) p_mask: per-token masking probability of shape (b, l)
Return type:: noisy_batch

bridge.diffusion.common.dllm.add_gumbel_noise( logits: torch.Tensor, temperature: float, ) → torch.Tensor#

Apply Gumbel noise to logits for stochastic sampling.

At temperature == 0 this is a no-op (returns logits unchanged), so an argmax over the result is plain greedy decoding.

Parameters:

logits – Unnormalized scores of shape [..., vocab_size].
temperature – Sampling temperature. 0 disables noise (greedy).

Returns:

Noised scores (float64 when noise is applied) whose argmax samples from the temperature-scaled distribution.

bridge.diffusion.common.dllm.get_num_transfer_tokens( mask_index: torch.Tensor, steps: int, ) → torch.Tensor#

Compute how many masked tokens to unmask at each diffusion step.

Distributes the number of masked positions as evenly as possible across steps, giving the earlier steps the remainder.

Parameters:

mask_index – Boolean tensor [batch, seq_len] (True where masked).
steps – Number of denoising steps to spread the unmasking over.

Returns:

Int64 tensor [batch, steps] whose rows sum to each sequence’s mask count.

bridge.diffusion.common.dllm.get_transfer_index( logits: torch.Tensor, temperature: float, remasking: str, mask_index: torch.Tensor, x: torch.Tensor, num_transfer_tokens: torch.Tensor, threshold: Optional[float] = None, neg_entropy: bool = False, ) → tuple[torch.Tensor, torch.Tensor]#

Select which masked positions to unmask at one diffusion step.

Samples candidate tokens (x0) from logits and, among currently masked positions, transfers the highest-confidence ones from mask to real token. Used identically by every block-diffusion generation loop in the repo regardless of attention semantics.

Parameters:

logits – Per-position scores [batch, seq_len, vocab_size].
temperature – Sampling temperature for Gumbel noise (0 = greedy).
remasking – Confidence source for ranking: "low_confidence" uses the softmax probability of the chosen token; "random" uses uniform noise.
mask_index – Boolean [batch, seq_len] marking still-masked positions.
x – Current token ids [batch, seq_len]; non-masked positions are kept.
num_transfer_tokens – Per-sequence count of tokens to unmask this step ([batch] slice of :func:get_num_transfer_tokens). Ignored when threshold is set.
threshold – If set, transfer every masked position whose confidence exceeds this value instead of a fixed count.
neg_entropy – If True, rank by negative entropy of the distribution instead of the chosen token’s probability.

Returns:

Tuple (x0, transfer_index) where x0 is the candidate token ids (non-masked positions unchanged) and transfer_index is a boolean mask of positions to commit this step.

bridge.diffusion.common.dllm.compute_block_mask(block_size, max_seq_length)#

Compute the sbd_block_diff attention mask.

The semi-block-diffusion mask is composed of three sub-masks over a doubled sequence [xt | x0] of length 2*max_seq_length:

Block Diagonal (M_BD): self-attention within noised blocks (xt only)
Offset Block-Causal (M_OBC): cross-attention from xt to past x0 blocks
Fully Causal (M_FC): fully causal attention within x0

Parameters:

block_size – Block size for block-based attention.
max_seq_length – Length of one half (xt or x0) of the sequence.

Returns:

BlockMask for use with flex_attention.

bridge.diffusion.common.dllm#

Module Contents#

Functions#

API#

`bridge.diffusion.common.dllm`#