nemo_automodel.components.datasets.llm.formatting_utils#
Module Contents#
Functions#
Resolve a chat template string that may be a file path. |
|
Boolean mask identifying right-trailing padding positions. |
|
Pad a sample to a specific sequence length. |
|
Add pad token to tokenizer if not present. |
|
Check if the tokenizer supports a chat template. |
|
Package a tokenized example with proper masking and padding. |
|
Format a prompt-completion style example (without chat template). |
|
Format a chat template style example. |
Data#
API#
- nemo_automodel.components.datasets.llm.formatting_utils.logger#
βgetLogger(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils._resolve_chat_template(
- chat_template: Optional[str],
Resolve a chat template string that may be a file path.
If chat_template points to an existing file, its contents are returned. If opening it as a file fails and the string contains Jinja-like characters (
{,}, or newlines) it is treated as a literal template. Otherwise a :class:ValueErroris raised so the caller knows the path was invalid.- Parameters:
chat_template β A Jinja template string or path to a template file.
- Returns:
The resolved template string, or
Nonewhen the input isNone.
- nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX#
βcompile(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils._get_right_trailing_pad_mask(
- sequence: torch.Tensor,
- pad_token_id: int,
- eos_token_id: int,
Boolean mask identifying right-trailing padding positions.
When pad_token_id != eos_token_id, it is simply
sequence == pad_token_id.When the two IDs collide, a plain equality check would also match real EOS tokens inside the content. In that case the function locates the trailing contiguous run of the shared token and treats all positions after the first one in that run as padding. The first token in the trailing run is the real EOS and is kept unmasked so the model still learns to predict end-of-sequence.
- Parameters:
sequence β 1-D token id tensor.
pad_token_id β The token id used for padding.
eos_token_id β The token id used for end-of-sequence. When equal to pad_token_id the positional trailing-run logic is used.
- Returns:
Boolean tensor (same shape as sequence) where
True= padding.
- nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#
Pad a sample to a specific sequence length.
- nemo_automodel.components.datasets.llm.formatting_utils._warned_add_pad_token#
βset(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#
Add pad token to tokenizer if not present.
- nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
Check if the tokenizer supports a chat template.
- Parameters:
tokenizer β The tokenizer to check.
- Returns:
True if the tokenizer supports a chat template, False otherwise.
- nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
- tokenizer,
- input_ids,
- assistant_masks,
- eos_token_id,
- pad_token_id,
- seq_length,
- truncation='do_not_truncate',
- padding='do_not_pad',
Package a tokenized example with proper masking and padding.
- Parameters:
tokenizer β The tokenizer to use.
input_ids β The tokenized input ids.
assistant_masks β Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
truncation β Optional truncation strategy.
padding β Optional padding strategy.
- Returns:
A dictionary with input_ids, labels, and attention_mask.
- nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
- tokenizer: transformers.PreTrainedTokenizer,
- prompt: str,
- answer: str,
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- answer_only_loss_mask: bool = True,
Format a prompt-completion style example (without chat template).
- Parameters:
tokenizer β The tokenizer to use.
prompt β The prompt string (e.g. context + question).
answer β The answer string.
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
- Returns:
A dictionary with the formatted example.
- nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
- formatted_text: List[Dict[str, str]],
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- tools: Optional[List[Dict]] = None,
- answer_only_loss_mask: bool = True,
Format a chat template style example.
- Parameters:
tokenizer β The tokenizer to use.
formatted_text β The formatted text, with role tags embedded in the content.
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
tools β Optional list of tool definitions for function calling.
answer_only_loss_mask β Whether to compute the loss mask only on the answer tokens.
- Returns:
A dictionary with the formatted example.