nemo_automodel.components.datasets.llm.formatting_utils#

Module Contents#

Functions#

_pad_to_seq_length

Pad a sample to a specific sequence length.

_add_pad_token

Add pad token to tokenizer if not present.

_has_chat_template

Check if the tokenizer supports a chat template.

_package_tokenized_example

Package a tokenized example with proper masking and padding.

format_prompt_completion

Format a prompt-completion style example (without chat template).

format_chat_template

Format a chat template style example.

Data#

API#

nemo_automodel.components.datasets.llm.formatting_utils.logger#

β€˜getLogger(…)’

nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX#

β€˜compile(…)’

nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#

Pad a sample to a specific sequence length.

nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#

Add pad token to tokenizer if not present.

nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
) bool#

Check if the tokenizer supports a chat template.

Parameters:

tokenizer – The tokenizer to check.

Returns:

True if the tokenizer supports a chat template, False otherwise.

nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
tokenizer,
input_ids,
assistant_masks,
eos_token_id,
pad_token_id,
seq_length,
truncation='do_not_truncate',
padding='do_not_pad',
)#

Package a tokenized example with proper masking and padding.

Parameters:
  • tokenizer – The tokenizer to use.

  • input_ids – The tokenized input ids.

  • assistant_masks – Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

  • truncation – Optional truncation strategy.

  • padding – Optional padding strategy.

Returns:

A dictionary with input_ids, labels, and attention_mask.

nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
tokenizer: transformers.PreTrainedTokenizer,
prompt: str,
answer: str,
eos_token_id: int,
pad_token_id: int,
seq_length: Optional[int] = None,
padding: Union[str, bool] = 'do_not_pad',
truncation: Union[str, bool] = 'do_not_truncate',
answer_only_loss_mask: bool = True,
) Dict[str, List[int]]#

Format a prompt-completion style example (without chat template).

Parameters:
  • tokenizer – The tokenizer to use.

  • prompt – The prompt string (e.g. context + question).

  • answer – The answer string.

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

Returns:

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
formatted_text: List[Dict[str, str]],
eos_token_id: int,
pad_token_id: int,
seq_length: Optional[int] = None,
padding: Union[str, bool] = 'do_not_pad',
truncation: Union[str, bool] = 'do_not_truncate',
tools: Optional[List[Dict]] = None,
answer_only_loss_mask: bool = True,
) Dict[str, List[int]]#

Format a chat template style example.

Parameters:
  • tokenizer – The tokenizer to use.

  • formatted_text – The formatted text, with role tags embedded in the content.

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

  • tools – Optional list of tool definitions for function calling.

  • answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.

Returns:

A dictionary with the formatted example.