nemo_automodel.components.datasets.dllm.collate
nemo_automodel.components.datasets.dllm.collate
Collate function for dLLM training.
Expects datasets that produce unshifted format (input_ids +
loss_mask, via _package_tokenized_example(unshifted=True)).
Goes directly from variable-length sample lists to block-aligned tensors
in a single pass.
Two-stage block-aligned padding layout::
[real tokens][EOS block-pad, loss=1][PAD global-pad, loss=0]
Module Contents
Classes
API
Collator for dLLM (diffusion LLM) training.
Goes directly from variable-length sample dicts to block-aligned tensors in a single pass — no intermediate pad-to-max step.
Expects each sample to have input_ids, loss_mask, and
attention_mask (as produced by
_package_tokenized_example(unshifted=True)).
Parameters:
Token ID for global (stage-2) padding.
Token ID for block (stage-1) padding. Only used when block_size is set.
If set, apply two-stage block-aligned padding.
Round final length to
lcm(block_size, pad_seq_len_divisible).
gemma4 response-window mode. When True the EOS
block-fill is RESPONSE-RELATIVE (aligned on the first supervised
position, matching Google’s ChunkResponseIntoCanvases) and the fill
is marked attended (attention_mask=1), and a one-time
single-turn guard rejects multi-turn loss_mask. When False
(default; llada / nemotron full-sequence denoising) the fill is
ABSOLUTE (block-aligned on the content length) and not attended,
and no single-turn guard runs — the pre-response-window behavior.
Per-sample end of the EOS block-fill, RESPONSE-RELATIVE.
The fill rounds the response length (measured from prefix, the response
start) up to a block_size multiple, so fill_end - prefix is a whole
number of canvas blocks. With prefix == 0 (no prompt, e.g. plain MDLM)
this reduces to the old absolute rounding, so non-SFT paths are unchanged.
First index where loss_mask is truthy (the response start), else
default (no supervised token -> treat the whole sample as prefix).
Pad variable-length lists to target_len with two-stage fill.