nemo_automodel.components.datasets.llm.formatting_utils#

Module Contents#

Functions#

_pad_to_seq_length

Pad a sample to a specific sequence length.

_add_pad_token

Add pad token to tokenizer if not present.

_has_chat_template

Check if the tokenizer supports a chat template.

_package_tokenized_example

Package a tokenized example with proper masking and padding.

format_prompt_completion

Format a prompt-completion style example (without chat template).

format_chat_template

Format a chat template style example.

API#

nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#

Pad a sample to a specific sequence length.

nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#

Add pad token to tokenizer if not present.

nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
) bool#

Check if the tokenizer supports a chat template.

Parameters:

tokenizer – The tokenizer to check.

Returns:

True if the tokenizer supports a chat template, False otherwise.

nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
has_chat_template,
input_ids,
eos_token_id,
pad_token_id,
seq_length,
context_len,
)#

Package a tokenized example with proper masking and padding.

Parameters:
  • has_chat_template – Whether the tokenizer has a chat template.

  • input_ids – The tokenized input ids.

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

  • context_len – Length of the context/prompt (to mask in labels).

Returns:

A dictionary with input_ids, labels, and attention_mask.

nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
tokenizer: transformers.PreTrainedTokenizer,
prompt: str,
answer: str,
eos_token_id: int,
pad_token_id: int,
seq_length: Optional[int] = None,
answer_only_loss_mask: bool = True,
) Dict[str, List[int]]#

Format a prompt-completion style example (without chat template).

Parameters:
  • tokenizer – The tokenizer to use.

  • prompt – The prompt string (e.g. context + question).

  • answer – The answer string.

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

Returns:

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
formatted_text: List[Dict[str, str]],
eos_token_id: int,
pad_token_id: int,
seq_length: Optional[int] = None,
start_of_turn_token: Optional[str] = None,
tools: Optional[List[Dict]] = None,
) Dict[str, List[int]]#

Format a chat template style example.

Parameters:
  • tokenizer – The tokenizer to use.

  • formatted_text – The formatted text, with role tags embedded in the content.

  • eos_token_id – The end-of-sequence token id.

  • pad_token_id – The padding token id.

  • seq_length – Optional sequence length for padding.

  • start_of_turn_token – The start of turn token string.

Returns:

A dictionary with the formatted example.