nemo_automodel.components.datasets.llm.formatting_utils
#
Module Contents#
Functions#
Pad a sample to a specific sequence length. |
|
Add pad token to tokenizer if not present. |
|
Check if the tokenizer supports a chat template. |
|
Package a tokenized example with proper masking and padding. |
|
Format a prompt-completion style example (without chat template). |
|
Format a chat template style example. |
API#
- nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#
Pad a sample to a specific sequence length.
- nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#
Add pad token to tokenizer if not present.
- nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
Check if the tokenizer supports a chat template.
- Parameters:
tokenizer – The tokenizer to check.
- Returns:
True if the tokenizer supports a chat template, False otherwise.
- nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example(
- has_chat_template,
- input_ids,
- eos_token_id,
- pad_token_id,
- seq_length,
- context_len,
Package a tokenized example with proper masking and padding.
- Parameters:
has_chat_template – Whether the tokenizer has a chat template.
input_ids – The tokenized input ids.
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.
context_len – Length of the context/prompt (to mask in labels).
- Returns:
A dictionary with input_ids, labels, and attention_mask.
- nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion(
- tokenizer: transformers.PreTrainedTokenizer,
- prompt: str,
- answer: str,
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- answer_only_loss_mask: bool = True,
Format a prompt-completion style example (without chat template).
- Parameters:
tokenizer – The tokenizer to use.
prompt – The prompt string (e.g. context + question).
answer – The answer string.
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.
- Returns:
A dictionary with the formatted example.
- nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
- formatted_text: List[Dict[str, str]],
- eos_token_id: int,
- pad_token_id: int,
- seq_length: Optional[int] = None,
- start_of_turn_token: Optional[str] = None,
- tools: Optional[List[Dict]] = None,
Format a chat template style example.
- Parameters:
tokenizer – The tokenizer to use.
formatted_text – The formatted text, with role tags embedded in the content.
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.
start_of_turn_token – The start of turn token string.
- Returns:
A dictionary with the formatted example.