nemo_automodel.components.datasets.llm.formatting_utils#

Module Contents#

Functions#

_get_right_trailing_pad_mask

Boolean mask identifying right-trailing padding positions.

_pad_to_seq_length

Pad a sample to a specific sequence length.

_add_pad_token

Add pad token to tokenizer if not present.

_has_chat_template

Check if the tokenizer supports a chat template.

_package_tokenized_example

Package a tokenized example with proper masking and padding.

format_prompt_completion

Format a prompt-completion style example (without chat template).

format_chat_template

Format a chat template style example.

Data#

API#

nemo_automodel.components.datasets.llm.formatting_utils.logger#

β€˜getLogger(…)’

nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX#

β€˜compile(…)’

nemo_automodel.components.datasets.llm.formatting_utils._get_right_trailing_pad_mask(
sequence: torch.Tensor,
pad_token_id: int,
eos_token_id: int,
) torch.Tensor#

Boolean mask identifying right-trailing padding positions.

When pad_token_id != eos_token_id, it is simply sequence == pad_token_id.

When the two IDs collide, a plain equality check would also match real EOS tokens inside the content. In that case the function locates the trailing contiguous run of the shared token and treats all positions after the first one in that run as padding. The first token in the trailing run is the real EOS and is kept unmasked so the model still learns to predict end-of-sequence.

Parameters:
  • sequence – 1-D token id tensor.

  • pad_token_id – The token id used for padding.

  • eos_token_id – The token id used for end-of-sequence. When equal to pad_token_id the positional trailing-run logic is used.

Returns:

Boolean tensor (same shape as sequence) where True = padding.

nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#

Pad a sample to a specific sequence length.

nemo_automodel.components.datasets.llm.formatting_utils._warned_add_pad_token#

β€˜set(…)’

nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#

Add pad token to tokenizer if not present.

nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
) bool#

Check if the tokenizer supports a chat template.

Parameters:

tokenizer – The tokenizer to check.

Returns:

True if the tokenizer supports a chat template, False otherwise.

nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
tokenizer,
input_ids,
assistant_masks,
eos_token_id,
pad_token_id,
seq_length,
truncation='do_not_truncate',
padding='do_not_pad',
)#

Package a tokenized example with proper masking and padding.

Parameters:
  • tokenizer – The tokenizer to use.

  • input_ids – The tokenized input ids.

  • assistant_masks – Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

  • truncation – Optional truncation strategy.

  • padding – Optional padding strategy.

Returns:

A dictionary with input_ids, labels, and attention_mask.

nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
tokenizer: transformers.PreTrainedTokenizer,
prompt: str,
answer: str,
eos_token_id: int,
pad_token_id: int,
seq_length: Optional[int] = None,
padding: Union[str, bool] = 'do_not_pad',
truncation: Union[str, bool] = 'do_not_truncate',
answer_only_loss_mask: bool = True,
) Dict[str, List[int]]#

Format a prompt-completion style example (without chat template).

Parameters:
  • tokenizer – The tokenizer to use.

  • prompt – The prompt string (e.g. context + question).

  • answer – The answer string.

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

Returns:

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
formatted_text: List[Dict[str, str]],
eos_token_id: int,
pad_token_id: int,
seq_length: Optional[int] = None,
padding: Union[str, bool] = 'do_not_pad',
truncation: Union[str, bool] = 'do_not_truncate',
tools: Optional[List[Dict]] = None,
answer_only_loss_mask: bool = True,
) Dict[str, List[int]]#

Format a chat template style example.

Parameters:
  • tokenizer – The tokenizer to use.

  • formatted_text – The formatted text, with role tags embedded in the content.

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

  • tools – Optional list of tool definitions for function calling.

  • answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.

Returns:

A dictionary with the formatted example.