`nemo_automodel.components.datasets.llm.formatting_utils`#

Module Contents#

Functions#

`_pad_to_seq_length`	Pad a sample to a specific sequence length.
`_add_pad_token`	Add pad token to tokenizer if not present.
`_has_chat_template`	Check if the tokenizer supports a chat template.
`_package_tokenized_example`	Package a tokenized example with proper masking and padding.
`format_prompt_completion`	Format a prompt-completion style example (without chat template).
`format_chat_template`	Format a chat template style example.

Data#

`logger`
`GENERATION_REGEX`

API#

nemo_automodel.components.datasets.llm.formatting_utils.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.llm.formatting_utils.GENERATION_REGEX#: ‘compile(…)’

nemo_automodel.components.datasets.llm.formatting_utils._pad_to_seq_length(sample, pad_token_id, seq_length)#: Pad a sample to a specific sequence length.

nemo_automodel.components.datasets.llm.formatting_utils._add_pad_token(tokenizer)#: Add pad token to tokenizer if not present.

nemo_automodel.components.datasets.llm.formatting_utils._has_chat_template( tokenizer: transformers.PreTrainedTokenizer, ) → bool#

Check if the tokenizer supports a chat template.

Parameters:: tokenizer – The tokenizer to check.
Returns:: True if the tokenizer supports a chat template, False otherwise.

nemo_automodel.components.datasets.llm.formatting_utils._package_tokenized_example( tokenizer, input_ids, assistant_masks, eos_token_id, pad_token_id, seq_length, truncation='do_not_truncate', padding='do_not_pad', )#

Package a tokenized example with proper masking and padding.

Parameters:

tokenizer – The tokenizer to use.
input_ids – The tokenized input ids.
assistant_masks – Boolean mask indicating which tokens are assistant/answer tokens (1) vs prompt tokens (0).
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.
truncation – Optional truncation strategy.
padding – Optional padding strategy.

Returns:

A dictionary with input_ids, labels, and attention_mask.

nemo_automodel.components.datasets.llm.formatting_utils.format_prompt_completion( tokenizer: transformers.PreTrainedTokenizer, prompt: str, answer: str, eos_token_id: int, pad_token_id: int, seq_length: Optional[int] = None, padding: Union[str, bool] = 'do_not_pad', truncation: Union[str, bool] = 'do_not_truncate', answer_only_loss_mask: bool = True, ) → Dict[str, List[int]]#

Format a prompt-completion style example (without chat template).

Parameters:

tokenizer – The tokenizer to use.
prompt – The prompt string (e.g. context + question).
answer – The answer string.
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.

Returns:

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils.format_chat_template( tokenizer: transformers.PreTrainedTokenizer, formatted_text: List[Dict[str, str]], eos_token_id: int, pad_token_id: int, seq_length: Optional[int] = None, padding: Union[str, bool] = 'do_not_pad', truncation: Union[str, bool] = 'do_not_truncate', tools: Optional[List[Dict]] = None, answer_only_loss_mask: bool = True, ) → Dict[str, List[int]]#

Format a chat template style example.

Parameters:

tokenizer – The tokenizer to use.
formatted_text – The formatted text, with role tags embedded in the content.
eos_token_id – The end-of-sequence token id.
pad_token_id – The padding token id.
seq_length – Optional sequence length for padding.
tools – Optional list of tool definitions for function calling.
answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.

Returns:

A dictionary with the formatted example.

nemo_automodel.components.datasets.llm.formatting_utils#

Module Contents#

Functions#

Data#

API#

`nemo_automodel.components.datasets.llm.formatting_utils`#