nemo_automodel.components.datasets.llm.formatting_utils#
Module Contents#
Functions#
Pad a sample to a specific sequence length. |
|
Add pad token to tokenizer if not present. |
|
Check if the tokenizer supports a chat template. |
|
Package a tokenized example with proper masking and padding. |
|
Format a prompt-completion style example (without chat template). |
|
Format a chat template style example. |
Data#
API#
- nemo_automodel.components.datasets.llm.formatting_utils.logger#
βgetLogger(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX#
βcompile(β¦)β
- nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#
Pad a sample to a specific sequence length.
- nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#
Add pad token to tokenizer if not present.
- nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
Check if the tokenizer supports a chat template.
- Parameters:
tokenizer β The tokenizer to check.
- Returns:
True if the tokenizer supports a chat template, False otherwise.
- nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
- tokenizer,
- input_ids,
- assistant_masks,
- eos_token_id,
- pad_token_id,
- seq_length,
- truncation='do_not_truncate',
- padding='do_not_pad',
Package a tokenized example with proper masking and padding.
- Parameters:
tokenizer β The tokenizer to use.
input_ids β The tokenized input ids.
assistant_masks β Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
truncation β Optional truncation strategy.
padding β Optional padding strategy.
- Returns:
A dictionary with input_ids, labels, and attention_mask.
- nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
- tokenizer: transformers.PreTrainedTokenizer,
- prompt: str,
- answer: str,
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- answer_only_loss_mask: bool = True,
Format a prompt-completion style example (without chat template).
- Parameters:
tokenizer β The tokenizer to use.
prompt β The prompt string (e.g. context + question).
answer β The answer string.
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
- Returns:
A dictionary with the formatted example.
- nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
- formatted_text: List[Dict[str, str]],
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- tools: Optional[List[Dict]] = None,
- answer_only_loss_mask: bool = True,
Format a chat template style example.
- Parameters:
tokenizer β The tokenizer to use.
formatted_text β The formatted text, with role tags embedded in the content.
eos_token_id β The end-of-sequence token id.
pad_token_id β The padding token id.
seq_length β Optional sequence length for padding.
tools β Optional list of tool definitions for function calling.
answer_only_loss_mask β Whether to compute the loss mask only on the answer tokens.
- Returns:
A dictionary with the formatted example.