nemo_automodel.components.datasets.dllm.collate

View as Markdown

Collate function for dLLM training.

Expects datasets that produce unshifted format (input_ids + loss_mask, via _package_tokenized_example(unshifted=True)). Goes directly from variable-length sample lists to block-aligned tensors in a single pass.

Two-stage block-aligned padding layout::

[real tokens][EOS block-pad, loss=1][PAD global-pad, loss=0]

Module Contents

Classes

NameDescription
DLLMCollatorCollator for dLLM (diffusion LLM) training.

API

class nemo_automodel.components.datasets.dllm.collate.DLLMCollator(
pad_token_id: int = 0,
eos_token_id: typing.Optional[int] = None,
block_size: typing.Optional[int] = None,
pad_seq_len_divisible: typing.Optional[int] = None,
max_seq_len: typing.Optional[int] = None,
supervise_padding: bool = False,
response_window: bool = False
)

Collator for dLLM (diffusion LLM) training.

Goes directly from variable-length sample dicts to block-aligned tensors in a single pass — no intermediate pad-to-max step.

Expects each sample to have input_ids, loss_mask, and attention_mask (as produced by _package_tokenized_example(unshifted=True)).

Parameters:

pad_token_id
intDefaults to 0

Token ID for global (stage-2) padding.

eos_token_id
Optional[int]Defaults to None

Token ID for block (stage-1) padding. Only used when block_size is set.

block_size
Optional[int]Defaults to None

If set, apply two-stage block-aligned padding.

pad_seq_len_divisible
Optional[int]Defaults to None

Round final length to lcm(block_size, pad_seq_len_divisible).

response_window
boolDefaults to False

gemma4 response-window mode. When True the EOS block-fill is RESPONSE-RELATIVE (aligned on the first supervised position, matching Google’s ChunkResponseIntoCanvases) and the fill is marked attended (attention_mask=1), and a one-time single-turn guard rejects multi-turn loss_mask. When False (default; llada / nemotron full-sequence denoising) the fill is ABSOLUTE (block-aligned on the content length) and not attended, and no single-turn guard runs — the pre-response-window behavior.

block_pad_token_id
nemo_automodel.components.datasets.dllm.collate.DLLMCollator.__call__(
batch: typing.List[typing.Dict[str, list]]
) -> typing.Dict[str, torch.Tensor]
nemo_automodel.components.datasets.dllm.collate.DLLMCollator._block_fill_ends(
content_lengths: torch.Tensor,
prefix_lengths: torch.Tensor
) -> torch.Tensor

Per-sample end of the EOS block-fill, RESPONSE-RELATIVE.

The fill rounds the response length (measured from prefix, the response start) up to a block_size multiple, so fill_end - prefix is a whole number of canvas blocks. With prefix == 0 (no prompt, e.g. plain MDLM) this reduces to the old absolute rounding, so non-SFT paths are unchanged.

nemo_automodel.components.datasets.dllm.collate.DLLMCollator._compute_target_length(
fill_ends: torch.Tensor
) -> int
nemo_automodel.components.datasets.dllm.collate.DLLMCollator._first_supervised_index(
loss_mask: list,
default: int
) -> int
staticmethod

First index where loss_mask is truthy (the response start), else default (no supervised token -> treat the whole sample as prefix).

nemo_automodel.components.datasets.dllm.collate.DLLMCollator._pad_and_fill(
samples: typing.List[list],
content_lengths: torch.Tensor,
fill_ends: torch.Tensor,
target_len: int,
pad_value: int,
block_pad_value: int,
dtype: torch.dtype = torch.long
) -> torch.Tensor

Pad variable-length lists to target_len with two-stage fill.