nemo_automodel.components.datasets.llm.formatting_utils
nemo_automodel.components.datasets.llm.formatting_utils
Module Contents
Functions
Data
API
Add pad token to tokenizer if not present.
Build a fallback loss mask that supervises every assistant turn.
Each assistant span is located by tokenizing the conversation prefixes
before and after the turn, which is O(turns) apply_chat_template calls.
Two reductions keep that from re-doing work:
full_lengthis the caller’s already-known unpadded token count for the whole conversation (sum(attention_mask)). When the dialogue ends on an assistant turn its closing boundary is the full conversation, so passingfull_lengthskips re-tokenizing the entire prefix — the single most expensive call in the loop.- Prefix lengths are memoized so a boundary shared by adjacent turns (a turn’s end and the next turn’s start) is tokenized at most once.
Both are exact: full_length and the memoized values equal what
:func:_tokenized_chat_length would return, so the mask is unchanged.
Build a token mask for reasoning_content spans inside assistant turns.
Locate the contiguous token span attributable to reasoning content.
Boolean mask identifying right-trailing padding positions.
When pad_token_id != eos_token_id, it is simply sequence == pad_token_id.
When the two IDs collide, a plain equality check would also match real EOS tokens inside the content. In that case the function locates the trailing contiguous run of the shared token and treats all positions after the first one in that run as padding. The first token in the trailing run is the real EOS and is kept unmasked so the model still learns to predict end-of-sequence.
Parameters:
1-D token id tensor.
The token id used for padding.
The token id used for end-of-sequence. When equal to pad_token_id the positional trailing-run logic is used.
Returns: torch.Tensor
Boolean tensor (same shape as sequence) where True = padding.
Check if the tokenizer supports a chat template.
Parameters:
The tokenizer to check.
Returns: bool
True if the tokenizer supports a chat template, False otherwise.
Restrict supervision to the final assistant turn (mask_history).
Operates on any per-token sequence where ignore_index marks
unsupervised positions: a label list (ignore_index=-100) or a 0/1
assistant mask (ignore_index=0). Each assistant turn renders as a
single contiguous supervised span, so this keeps only the last such run
and rewrites every earlier supervised position to ignore_index.
Apply this to the assistant mask before any reasoning_content holes are punched into it; running it on already-holed labels would treat the reasoning gap as a turn boundary and drop in-turn content before the hole.
Parameters:
per-token labels or 0/1 mask (ignore_index marks unsupervised).
the value marking unsupervised positions.
Returns: List[int]
The same list, mutated so only the final supervised run is kept.
Return a copy of a message with reasoning_content removed.
Shift a token-level mask right when the tokenizer uses left padding.
_build_multiturn_assistant_mask and _build_reasoning_mask compute
span indices from unpadded (left-aligned) tokenizations. When the
tokenizer pads on the left, actual content is right-aligned in
input_ids, so the mask must be shifted right by the padding offset to
keep positions aligned.
For right-padding tokenizers (the majority) this is a no-op.
Package a tokenized example with proper masking and padding.
Returns:
A dictionary with input_ids, labels, and attention_mask.
When unshifted is True, labels is replaced by loss_mask.
Parameters:
The tokenizer to use.
The tokenized input ids.
Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).
The end-of-sequence token id.
The padding token id.
Optional sequence length for padding.
Optional truncation strategy.
Optional padding strategy.
If True, return unshifted format for dLLM training
(input_ids at full length with loss_mask instead of
shifted input_ids/labels).
Pad a sample to a specific sequence length.
Resolve a chat template string that may be a file path.
If chat_template points to an existing file, its contents are returned.
If opening it as a file fails and the string contains Jinja-like characters
({, }, or newlines) it is treated as a literal template. Otherwise
a :class:ValueError is raised so the caller knows the path was invalid.
Parameters:
A Jinja template string or path to a template file.
Returns: Optional[str]
The resolved template string, or None when the input is None.
Tokenize chat messages without padding and return input ids.
Return the tokenized chat length for a message prefix without padding.
Format a chat template style example.
Parameters:
The tokenizer to use.
The formatted text, with role tags embedded in the content.
The end-of-sequence token id.
The padding token id.
Optional sequence length for padding.
Optional list of tool definitions for function calling.
Whether to compute the loss mask only on the answer tokens.
Whether to exclude rendered reasoning_content tokens from loss.
Whether to supervise only the final assistant turn,
masking every earlier assistant turn (mask_history). Applied to the
assistant mask before reasoning_content is masked out.
Returns: Dict[str, List[int]]
A dictionary with the formatted example.
Format a prompt-completion style example (without chat template).
Parameters:
The tokenizer to use.
The prompt string (e.g. context + question).
The answer string.
The end-of-sequence token id.
The padding token id.
Optional sequence length for padding.
Returns: Dict[str, List[int]]
A dictionary with the formatted example.