nemo_automodel.components.datasets.dllm.collate#

Collate function for dLLM training.

Expects datasets that produce unshifted format (input_ids + loss_mask, via _package_tokenized_example(unshifted=True)). Goes directly from variable-length sample lists to block-aligned tensors in a single pass.

Two-stage block-aligned padding layout::

[real tokens][EOS block-pad, loss=1][PAD global-pad, loss=0]

Module Contents#

Classes#

DLLMCollator

Collator for dLLM (diffusion LLM) training.

API#

class nemo_automodel.components.datasets.dllm.collate.DLLMCollator(
pad_token_id: int = 0,
eos_token_id: Optional[int] = None,
block_size: Optional[int] = None,
pad_seq_len_divisible: Optional[int] = None,
max_seq_len: Optional[int] = None,
supervise_padding: bool = False,
)#

Collator for dLLM (diffusion LLM) training.

Goes directly from variable-length sample dicts to block-aligned tensors in a single pass — no intermediate pad-to-max step.

Expects each sample to have input_ids, loss_mask, and attention_mask (as produced by _package_tokenized_example(unshifted=True)).

Parameters:
  • pad_token_id – Token ID for global (stage-2) padding.

  • eos_token_id – Token ID for block (stage-1) padding. Only used when block_size is set.

  • block_size – If set, apply two-stage block-aligned padding.

  • pad_seq_len_divisible – Round final length to lcm(block_size, pad_seq_len_divisible).

Initialization

__call__(
batch: List[Dict[str, list]],
) Dict[str, torch.Tensor]#
_compute_target_length(content_lengths: torch.Tensor) int#
_pad_and_fill(
samples: List[list],
content_lengths: torch.Tensor,
target_len: int,
pad_value: int,
block_pad_value: int,
apply_block_fill: bool = True,
dtype: torch.dtype = torch.long,
) torch.Tensor#

Pad variable-length lists to target_len with two-stage fill.

For each sample:

  • [0, content_length) → original content

  • [content_length, block_aligned) → block_pad_value

  • [block_aligned, target_len) → pad_value