`nemo_automodel.components.datasets.llm.formatting_utils`#

Module Contents#

Functions#

`_resolve_chat_template`	Resolve a chat template string that may be a file path.
`_tokenized_chat_length`	Return the tokenized chat length for a message prefix without padding.
`_tokenize_chat`	Tokenize chat messages without padding and return input ids.
`_maybe_shift_mask_for_left_padding`	Shift a token-level mask right when the tokenizer uses left padding.
`_build_multiturn_assistant_mask`	Build a fallback loss mask that supervises every assistant turn.
`_masked_reasoning_message`	Return a copy of a message with reasoning_content removed.
`_find_reasoning_span`	Locate the contiguous token span attributable to reasoning content.
`_build_reasoning_mask`	Build a token mask for reasoning_content spans inside assistant turns.
`_get_right_trailing_pad_mask`	Boolean mask identifying right-trailing padding positions.
`_pad_to_seq_length`	Pad a sample to a specific sequence length.
`_add_pad_token`	Add pad token to tokenizer if not present.
`_has_chat_template`	Check if the tokenizer supports a chat template.
`_package_tokenized_example`	Package a tokenized example with proper masking and padding.
`format_prompt_completion`	Format a prompt-completion style example (without chat template).
`format_chat_template`	Format a chat template style example.

Data#

`logger`
`GENERATION_REGEX`
`_warned_add_pad_token`

API#

nemo_automodel.components.datasets.llm.formatting_utils.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.llm.formatting_utils._resolve_chat_template( chat_template: Optional[str], ) → Optional[str]#

Resolve a chat template string that may be a file path.

If chat_template points to an existing file, its contents are returned. If opening it as a file fails and the string contains Jinja-like characters ({, }, or newlines) it is treated as a literal template. Otherwise a :class:ValueError is raised so the caller knows the path was invalid.

Parameters:: chat_template – A Jinja template string or path to a template file.
Returns:: The resolved template string, or None when the input is None.

nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX#: ‘compile(…)’

nemo_automodel.components.datasets.llm.formatting_utils._tokenized_chat_length( tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], *, tools: Optional[List[Dict]] = None, truncation: Union[str, bool] = 'do_not_truncate', seq_length: Optional[int] = None, ) → int#: Return the tokenized chat length for a message prefix without padding.

nemo_automodel.components.datasets.llm.formatting_utils._tokenize_chat( tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, Any]], *, tools: Optional[List[Dict]] = None, truncation: Union[str, bool] = 'do_not_truncate', seq_length: Optional[int] = None, ) → List[int]#: Tokenize chat messages without padding and return input ids.

nemo_automodel.components.datasets.llm.formatting_utils._maybe_shift_mask_for_left_padding( mask: List[int], tokenizer: transformers.PreTrainedTokenizer, attention_mask: Optional[List[int]], ) → List[int]#

Shift a token-level mask right when the tokenizer uses left padding.

_build_multiturn_assistant_mask and _build_reasoning_mask compute span indices from unpadded (left-aligned) tokenizations. When the tokenizer pads on the left, actual content is right-aligned in input_ids, so the mask must be shifted right by the padding offset to keep positions aligned.

For right-padding tokenizers (the majority) this is a no-op.

nemo_automodel.components.datasets.llm.formatting_utils._build_multiturn_assistant_mask( tokenizer: transformers.PreTrainedTokenizer, formatted_text: List[Dict[str, Any]], input_ids: List[int], *, tools: Optional[List[Dict]] = None, truncation: Union[str, bool] = 'do_not_truncate', seq_length: Optional[int] = None, ) → List[int]#: Build a fallback loss mask that supervises every assistant turn.

nemo_automodel.components.datasets.llm.formatting_utils._masked_reasoning_message( message: Dict[str, Any], ) → Dict[str, Any]#: Return a copy of a message with reasoning_content removed.

nemo_automodel.components.datasets.llm.formatting_utils._find_reasoning_span( full_segment: List[int], masked_segment: List[int], ) → Optional[tuple[int, int]]#: Locate the contiguous token span attributable to reasoning content.

nemo_automodel.components.datasets.llm.formatting_utils._build_reasoning_mask( tokenizer: transformers.PreTrainedTokenizer, formatted_text: List[Dict[str, Any]], input_ids: List[int], *, tools: Optional[List[Dict]] = None, truncation: Union[str, bool] = 'do_not_truncate', seq_length: Optional[int] = None, ) → List[int]#: Build a token mask for reasoning_content spans inside assistant turns.

nemo_automodel.components.datasets.llm.formatting_utils._get_right_trailing_pad_mask( sequence: torch.Tensor, pad_token_id: int, eos_token_id: int, ) → torch.Tensor#

Boolean mask identifying right-trailing padding positions.

When pad_token_id != eos_token_id, it is simply sequence == pad_token_id.

When the two IDs collide, a plain equality check would also match real EOS tokens inside the content. In that case the function locates the trailing contiguous run of the shared token and treats all positions after the first one in that run as padding. The first token in the trailing run is the real EOS and is kept unmasked so the model still learns to predict end-of-sequence.

Parameters:

sequence – 1-D token id tensor.
pad_token_id – The token id used for padding.
eos_token_id – The token id used for end-of-sequence. When equal to pad_token_id the positional trailing-run logic is used.

Returns:

Boolean tensor (same shape as sequence) where True = padding.

nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#: Pad a sample to a specific sequence length.

nemo_automodel.components.datasets.llm.formatting_utils._warned_add_pad_token#: ‘set(…)’

nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#: Add pad token to tokenizer if not present.

nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template( tokenizer: transformers.PreTrainedTokenizer, ) → bool#

Check if the tokenizer supports a chat template.

Parameters:: tokenizer – The tokenizer to check.
Returns:: True if the tokenizer supports a chat template, False otherwise.

nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example( tokenizer, input_ids, assistant_masks, eos_token_id, pad_token_id, seq_length, truncation='do_not_truncate', padding='do_not_pad', unshifted=False, )#

Package a tokenized example with proper masking and padding.

Parameters:

tokenizer – The tokenizer to use.
input_ids – The tokenized input ids.
assistant_masks – Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.
truncation – Optional truncation strategy.
padding – Optional padding strategy.
unshifted – If True, return unshifted format for dLLM training (input_ids at full length with loss_mask instead of shifted input_ids/labels).

Returns:

A dictionary with input_ids, labels, and attention_mask. When unshifted is True, labels is replaced by loss_mask.

nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion( tokenizer: transformers.PreTrainedTokenizer, prompt: str, answer: str, eos_token_id: int, pad_token_id: int, seq_length: Optional[int] = None, padding: Union[str, bool] = 'do_not_pad', truncation: Union[str, bool] = 'do_not_truncate', answer_only_loss_mask: bool = True, unshifted: bool = False, ) → Dict[str, List[int]]#

Format a prompt-completion style example (without chat template).

Parameters:

tokenizer – The tokenizer to use.
prompt – The prompt string (e.g. context + question).
answer – The answer string.
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.

Returns:

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template( tokenizer: transformers.PreTrainedTokenizer, formatted_text: List[Dict[str, Any]], eos_token_id: int, pad_token_id: int, seq_length: Optional[int] = None, padding: Union[str, bool] = 'do_not_pad', truncation: Union[str, bool] = 'do_not_truncate', tools: Optional[List[Dict]] = None, answer_only_loss_mask: bool = True, mask_reasoning_content: bool = False, unshifted: bool = False, ) → Dict[str, List[int]]#

Format a chat template style example.

Parameters:

tokenizer – The tokenizer to use.
formatted_text – The formatted text, with role tags embedded in the content.
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.
tools – Optional list of tool definitions for function calling.
answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.
mask_reasoning_content – Whether to exclude rendered reasoning_content tokens from loss.

Returns:

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils#

Module Contents#

Functions#

Data#

API#

`nemo_automodel.components.datasets.llm.formatting_utils`#