nemo_automodel.components.datasets.dllm.collate#
Collate function for dLLM training.
Expects datasets that produce unshifted format (input_ids +
loss_mask, via _package_tokenized_example(unshifted=True)).
Goes directly from variable-length sample lists to block-aligned tensors
in a single pass.
Two-stage block-aligned padding layout::
[real tokens][EOS block-pad, loss=1][PAD global-pad, loss=0]
Module Contents#
Classes#
Collator for dLLM (diffusion LLM) training. |
API#
- class nemo_automodel.components.datasets.dllm.collate.DLLMCollator(
- pad_token_id: int = 0,
- eos_token_id: Optional[int] = None,
- block_size: Optional[int] = None,
- pad_seq_len_divisible: Optional[int] = None,
- max_seq_len: Optional[int] = None,
- supervise_padding: bool = False,
Collator for dLLM (diffusion LLM) training.
Goes directly from variable-length sample dicts to block-aligned tensors in a single pass — no intermediate pad-to-max step.
Expects each sample to have
input_ids,loss_mask, andattention_mask(as produced by_package_tokenized_example(unshifted=True)).- Parameters:
pad_token_id – Token ID for global (stage-2) padding.
eos_token_id – Token ID for block (stage-1) padding. Only used when block_size is set.
block_size – If set, apply two-stage block-aligned padding.
pad_seq_len_divisible – Round final length to
lcm(block_size, pad_seq_len_divisible).
Initialization
- __call__(
- batch: List[Dict[str, list]],
- _compute_target_length(content_lengths: torch.Tensor) int#
- _pad_and_fill(
- samples: List[list],
- content_lengths: torch.Tensor,
- target_len: int,
- pad_value: int,
- block_pad_value: int,
- apply_block_fill: bool = True,
- dtype: torch.dtype = torch.long,
Pad variable-length lists to target_len with two-stage fill.
For each sample:
[0, content_length)→ original content[content_length, block_aligned)→ block_pad_value[block_aligned, target_len)→ pad_value