`nemo_automodel.datasets.utils`#

Module Contents#

Classes#

SFTSingleTurnPreprocessor

Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.

Functions#

`batchify`	Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.
`extract_key_from_dicts`	Extracts the value of the given key from each dictionary in a list of dictionaries.
`pad_within_micro`	Pads each list in a batch of lists to the same length with a specified token.
`default_collater`	Default batch collator that handles padding and batching.

API#

nemo_automodel.datasets.utils.batchify(tensor)[source]#

Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.

Parameters:: tensor (torch.Tensor) – The input tensor to be batchified.
Returns:: The tensor with an extra dimension added if it was originally 1-dimensional. Otherwise, the tensor is returned as-is.
Return type:: torch.Tensor

nemo_automodel.datasets.utils.extract_key_from_dicts(batch, key)[source]#

Extracts the value of the given key from each dictionary in a list of dictionaries.

Parameters:

batch (List[dict]) – A list of dictionaries.
key (str) – The key whose values are to be extracted from each dictionary.

Returns:

A list of values associated with the specified key, in the same order as the dictionaries in the input batch.

Return type:

List

nemo_automodel.datasets.utils.pad_within_micro(batch, pad_token_id, pad_seq_len_divisible=None)[source]#

Pads each list in a batch of lists to the same length with a specified token.

Parameters:

batch (List[List[int]]) – A batch of sequences (e.g., token IDs), where each sequence is a list of integers.
pad_token_id (int) – The token ID to use for padding shorter sequences.
pad_seq_len_divisible (int) – The value to use for padding sequence length so that it is divisible by pad_seq_len_divisible.

Returns:

A batch of sequences where each inner list has been padded with the pad token to match the length of the longest sequence in the batch.

Return type:

List[List[int]]

nemo_automodel.datasets.utils.default_collater(batch, pad_token_id=0, pad_seq_len_divisible=None)[source]#

Default batch collator that handles padding and batching.

Parameters:

batch – A batch of examples.
pad_token_id – The token ID to use for padding.
pad_seq_len_divisible – If provided, pad sequence length to be divisible by this value.

Returns:

A dictionary containing batched tensors.

Return type:

dict

class nemo_automodel.datasets.utils.SFTSingleTurnPreprocessor(tokenizer)[source]#

Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.

Parameters:: tokenizer – Pre-trained tokenizer (HF).

Initialization

SFTSingleTurnPreprocessor constructor.

Parameters:: tokenizer – Pretrained tokenizer.

_tokenize_function(examples, dataset)[source]#

_compute_dataset_max_len(tokenized_ds)[source]#

_pad_function(max_len)[source]#

process(raw_dataset, ds)[source]#

Main processor entry.

Parameters:

raw_dataset (datasets.DatasetDict) – the dataset (e.g. returned by load_dataset)
ds (dataset) – the dataset with get_target method.

Returns:

tokenized + padded datasets (all splits preserved).

Return type:

datasets.DatasetDict

nemo_automodel.datasets.utils#

Module Contents#

Classes#

Functions#

API#

`nemo_automodel.datasets.utils`#