nemo_automodel.components.datasets.llm.formatting_utils

View as Markdown

Module Contents

Functions

NameDescription
_add_pad_tokenAdd pad token to tokenizer if not present.
_build_multiturn_assistant_maskBuild a fallback loss mask that supervises every assistant turn.
_build_reasoning_maskBuild a token mask for reasoning_content spans inside assistant turns.
_find_reasoning_spanLocate the contiguous token span attributable to reasoning content.
_get_right_trailing_pad_maskBoolean mask identifying right-trailing padding positions.
_has_chat_templateCheck if the tokenizer supports a chat template.
_mask_labels_to_last_turnRestrict supervision to the final assistant turn (mask_history).
_masked_reasoning_messageReturn a copy of a message with reasoning_content removed.
_maybe_shift_mask_for_left_paddingShift a token-level mask right when the tokenizer uses left padding.
_package_tokenized_examplePackage a tokenized example with proper masking and padding.
_pad_to_seq_lengthPad a sample to a specific sequence length.
_resolve_chat_templateResolve a chat template string that may be a file path.
_tokenize_chatTokenize chat messages without padding and return input ids.
_tokenized_chat_lengthReturn the tokenized chat length for a message prefix without padding.
format_chat_templateFormat a chat template style example.
format_prompt_completionFormat a prompt-completion style example (without chat template).

Data

GENERATION_REGEX

_warned_add_pad_token

logger

API

nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(
tokenizer
)

Add pad token to tokenizer if not present.

nemo_automodel.components.datasets.llm.formatting_utils._build_multiturn_assistant_mask(
tokenizer: transformers.PreTrainedTokenizer,
formatted_text: typing.List[typing.Dict[str, typing.Any]],
input_ids: typing.List[int],
tools: typing.Optional[typing.List[typing.Dict]] = None,
truncation: typing.Union[str, bool] = 'do_not_truncate',
seq_length: typing.Optional[int] = None,
full_length: typing.Optional[int] = None
) -> typing.List[int]

Build a fallback loss mask that supervises every assistant turn.

Each assistant span is located by tokenizing the conversation prefixes before and after the turn, which is O(turns) apply_chat_template calls. Two reductions keep that from re-doing work:

  • full_length is the caller’s already-known unpadded token count for the whole conversation (sum(attention_mask)). When the dialogue ends on an assistant turn its closing boundary is the full conversation, so passing full_length skips re-tokenizing the entire prefix — the single most expensive call in the loop.
  • Prefix lengths are memoized so a boundary shared by adjacent turns (a turn’s end and the next turn’s start) is tokenized at most once.

Both are exact: full_length and the memoized values equal what :func:_tokenized_chat_length would return, so the mask is unchanged.

nemo_automodel.components.datasets.llm.formatting_utils._build_reasoning_mask(
tokenizer: transformers.PreTrainedTokenizer,
formatted_text: typing.List[typing.Dict[str, typing.Any]],
input_ids: typing.List[int],
tools: typing.Optional[typing.List[typing.Dict]] = None,
truncation: typing.Union[str, bool] = 'do_not_truncate',
seq_length: typing.Optional[int] = None
) -> typing.List[int]

Build a token mask for reasoning_content spans inside assistant turns.

nemo_automodel.components.datasets.llm.formatting_utils._find_reasoning_span(
full_segment: typing.List[int],
masked_segment: typing.List[int]
) -> typing.Optional[tuple[int, int]]

Locate the contiguous token span attributable to reasoning content.

nemo_automodel.components.datasets.llm.formatting_utils._get_right_trailing_pad_mask(
sequence: torch.Tensor,
pad_token_id: int,
eos_token_id: int
) -> torch.Tensor

Boolean mask identifying right-trailing padding positions.

When pad_token_id != eos_token_id, it is simply sequence == pad_token_id.

When the two IDs collide, a plain equality check would also match real EOS tokens inside the content. In that case the function locates the trailing contiguous run of the shared token and treats all positions after the first one in that run as padding. The first token in the trailing run is the real EOS and is kept unmasked so the model still learns to predict end-of-sequence.

Parameters:

sequence
torch.Tensor

1-D token id tensor.

pad_token_id
int

The token id used for padding.

eos_token_id
int

The token id used for end-of-sequence. When equal to pad_token_id the positional trailing-run logic is used.

Returns: torch.Tensor

Boolean tensor (same shape as sequence) where True = padding.

nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
tokenizer: transformers.PreTrainedTokenizer
) -> bool

Check if the tokenizer supports a chat template.

Parameters:

tokenizer
PreTrainedTokenizer

The tokenizer to check.

Returns: bool

True if the tokenizer supports a chat template, False otherwise.

nemo_automodel.components.datasets.llm.formatting_utils._mask_labels_to_last_turn(
mask: typing.List[int],
ignore_index: int = -100
) -> typing.List[int]

Restrict supervision to the final assistant turn (mask_history).

Operates on any per-token sequence where ignore_index marks unsupervised positions: a label list (ignore_index=-100) or a 0/1 assistant mask (ignore_index=0). Each assistant turn renders as a single contiguous supervised span, so this keeps only the last such run and rewrites every earlier supervised position to ignore_index.

Apply this to the assistant mask before any reasoning_content holes are punched into it; running it on already-holed labels would treat the reasoning gap as a turn boundary and drop in-turn content before the hole.

Parameters:

mask
List[int]

per-token labels or 0/1 mask (ignore_index marks unsupervised).

ignore_index
intDefaults to -100

the value marking unsupervised positions.

Returns: List[int]

The same list, mutated so only the final supervised run is kept.

nemo_automodel.components.datasets.llm.formatting_utils._masked_reasoning_message(
message: typing.Dict[str, typing.Any]
) -> typing.Dict[str, typing.Any]

Return a copy of a message with reasoning_content removed.

nemo_automodel.components.datasets.llm.formatting_utils._maybe_shift_mask_for_left_padding(
mask: typing.List[int],
tokenizer: transformers.PreTrainedTokenizer,
attention_mask: typing.Optional[typing.List[int]]
) -> typing.List[int]

Shift a token-level mask right when the tokenizer uses left padding.

_build_multiturn_assistant_mask and _build_reasoning_mask compute span indices from unpadded (left-aligned) tokenizations. When the tokenizer pads on the left, actual content is right-aligned in input_ids, so the mask must be shifted right by the padding offset to keep positions aligned.

For right-padding tokenizers (the majority) this is a no-op.

nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
tokenizer,
input_ids,
assistant_masks,
eos_token_id,
pad_token_id,
seq_length,
truncation = 'do_not_truncate',
padding = 'do_not_pad',
unshifted = False
)

Package a tokenized example with proper masking and padding.

Returns: A dictionary with input_ids, labels, and attention_mask. When unshifted is True, labels is replaced by loss_mask.

Parameters:

tokenizer

The tokenizer to use.

input_ids

The tokenized input ids.

assistant_masks

Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).

eos_token_id

The end-of-sequence token id.

pad_token_id

The padding token id.

seq_length

Optional sequence length for padding.

truncation
Defaults to 'do_not_truncate'

Optional truncation strategy.

padding
Defaults to 'do_not_pad'

Optional padding strategy.

unshifted
Defaults to False

If True, return unshifted format for dLLM training (input_ids at full length with loss_mask instead of shifted input_ids/labels).

nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(
sample,
pad_token_id,
seq_length
)

Pad a sample to a specific sequence length.

nemo_automodel.components.datasets.llm.formatting_utils._resolve_chat_template(
chat_template: typing.Optional[str]
) -> typing.Optional[str]

Resolve a chat template string that may be a file path.

If chat_template points to an existing file, its contents are returned. If opening it as a file fails and the string contains Jinja-like characters ({, }, or newlines) it is treated as a literal template. Otherwise a :class:ValueError is raised so the caller knows the path was invalid.

Parameters:

chat_template
Optional[str]

A Jinja template string or path to a template file.

Returns: Optional[str]

The resolved template string, or None when the input is None.

nemo_automodel.components.datasets.llm.formatting_utils._tokenize_chat(
tokenizer: transformers.PreTrainedTokenizer,
messages: typing.List[typing.Dict[str, typing.Any]],
tools: typing.Optional[typing.List[typing.Dict]] = None,
truncation: typing.Union[str, bool] = 'do_not_truncate',
seq_length: typing.Optional[int] = None
) -> typing.List[int]

Tokenize chat messages without padding and return input ids.

nemo_automodel.components.datasets.llm.formatting_utils._tokenized_chat_length(
tokenizer: transformers.PreTrainedTokenizer,
messages: typing.List[typing.Dict[str, str]],
tools: typing.Optional[typing.List[typing.Dict]] = None,
truncation: typing.Union[str, bool] = 'do_not_truncate',
seq_length: typing.Optional[int] = None
) -> int

Return the tokenized chat length for a message prefix without padding.

nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
formatted_text: typing.List[typing.Dict[str, typing.Any]],
eos_token_id: int,
pad_token_id: int,
seq_length: typing.Optional[int] = None,
padding: typing.Union[str, bool] = 'do_not_pad',
truncation: typing.Union[str, bool] = 'do_not_truncate',
tools: typing.Optional[typing.List[typing.Dict]] = None,
answer_only_loss_mask: bool = True,
mask_reasoning_content: bool = False,
train_on_last_turn_only: bool = False,
unshifted: bool = False
) -> typing.Dict[str, typing.List[int]]

Format a chat template style example.

Parameters:

tokenizer
PreTrainedTokenizer

The tokenizer to use.

formatted_text
List[Dict[str, Any]]

The formatted text, with role tags embedded in the content.

eos_token_id
int

The end-of-sequence token id.

pad_token_id
int

The padding token id.

seq_length
Optional[int]Defaults to None

Optional sequence length for padding.

tools
Optional[List[Dict]]Defaults to None

Optional list of tool definitions for function calling.

answer_only_loss_mask
boolDefaults to True

Whether to compute the loss mask only on the answer tokens.

mask_reasoning_content
boolDefaults to False

Whether to exclude rendered reasoning_content tokens from loss.

train_on_last_turn_only
boolDefaults to False

Whether to supervise only the final assistant turn, masking every earlier assistant turn (mask_history). Applied to the assistant mask before reasoning_content is masked out.

Returns: Dict[str, List[int]]

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
tokenizer: transformers.PreTrainedTokenizer,
prompt: str,
answer: str,
eos_token_id: int,
pad_token_id: int,
seq_length: typing.Optional[int] = None,
padding: typing.Union[str, bool] = 'do_not_pad',
truncation: typing.Union[str, bool] = 'do_not_truncate',
answer_only_loss_mask: bool = True,
unshifted: bool = False
) -> typing.Dict[str, typing.List[int]]

Format a prompt-completion style example (without chat template).

Parameters:

tokenizer
PreTrainedTokenizer

The tokenizer to use.

prompt
str

The prompt string (e.g. context + question).

answer
str

The answer string.

eos_token_id
int

The end-of-sequence token id.

pad_token_id
int

The padding token id.

seq_length
Optional[int]Defaults to None

Optional sequence length for padding.

Returns: Dict[str, List[int]]

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX = re.compile('\\{%-?\\s+generation\\s+-?%\\}')
nemo_automodel.components.datasets.llm.formatting_utils._warned_add_pad_token = set()
nemo_automodel.components.datasets.llm.formatting_utils.logger = logging.getLogger(__name__)