nemo_automodel.components.datasets.llm.formatting_utils#
Module Contents#
Functions#
Boolean mask identifying right-trailing padding positions. |
|
Pad a sample to a specific sequence length. |
|
Add pad token to tokenizer if not present. |
|
Check if the tokenizer supports a chat template. |
|
Package a tokenized example with proper masking and padding. |
|
Format a prompt-completion style example (without chat template). |
|
Format a chat template style example. |
Data#
API#
- nemo_automodel.components.datasets.llm.formatting_utils.logger#
βgetLogger(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX#
βcompile(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils._get_right_trailing_pad_mask(
- sequence: torch.Tensor,
- pad_token_id: int,
- eos_token_id: int,
Boolean mask identifying right-trailing padding positions.
When pad_token_id != eos_token_id, it is simply
sequence == pad_token_id.When the two IDs collide, a plain equality check would also match real EOS tokens inside the content. In that case the function locates the trailing contiguous run of the shared token and treats all positions after the first one in that run as padding. The first token in the trailing run is the real EOS and is kept unmasked so the model still learns to predict end-of-sequence.
- Parameters:
sequence β 1-D token id tensor.
pad_token_id β The token id used for padding.
eos_token_id β The token id used for end-of-sequence. When equal to pad_token_id the positional trailing-run logic is used.
- Returns:
Boolean tensor (same shape as sequence) where
True= padding.
- nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#
Pad a sample to a specific sequence length.
- nemo_automodel.components.datasets.llm.formatting_utils._warned_add_pad_token#
βset(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#
Add pad token to tokenizer if not present.
- nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
Check if the tokenizer supports a chat template.
- Parameters:
tokenizer β The tokenizer to check.
- Returns:
True if the tokenizer supports a chat template, False otherwise.
- nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
- tokenizer,
- input_ids,
- assistant_masks,
- eos_token_id,
- pad_token_id,
- seq_length,
- truncation='do_not_truncate',
- padding='do_not_pad',
Package a tokenized example with proper masking and padding.
- Parameters:
tokenizer β The tokenizer to use.
input_ids β The tokenized input ids.
assistant_masks β Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
truncation β Optional truncation strategy.
padding β Optional padding strategy.
- Returns:
A dictionary with input_ids, labels, and attention_mask.
- nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
- tokenizer: transformers.PreTrainedTokenizer,
- prompt: str,
- answer: str,
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- answer_only_loss_mask: bool = True,
Format a prompt-completion style example (without chat template).
- Parameters:
tokenizer β The tokenizer to use.
prompt β The prompt string (e.g. context + question).
answer β The answer string.
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
- Returns:
A dictionary with the formatted example.
- nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
- formatted_text: List[Dict[str, str]],
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- tools: Optional[List[Dict]] = None,
- answer_only_loss_mask: bool = True,
Format a chat template style example.
- Parameters:
tokenizer β The tokenizer to use.
formatted_text β The formatted text, with role tags embedded in the content.
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
tools β Optional list of tool definitions for function calling.
answer_only_loss_mask β Whether to compute the loss mask only on the answer tokens.
- Returns:
A dictionary with the formatted example.