nemo_automodel.datasets.utils#

Module Contents#

Classes#

SFTSingleTurnPreprocessor

Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.

Functions#

batchify

Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.

extract_key_from_dicts

Extracts the value of the given key from each dictionary in a list of dictionaries.

pad_within_micro

Pads each list in a batch of lists to the same length with a specified token.

default_collater

Default batch collator that handles padding and batching.

API#

nemo_automodel.datasets.utils.batchify(tensor)[source]#

Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.

Parameters:

tensor (torch.Tensor) – The input tensor to be batchified.

Returns:

The tensor with an extra dimension added if it was originally 1-dimensional. Otherwise, the tensor is returned as-is.

Return type:

torch.Tensor

nemo_automodel.datasets.utils.extract_key_from_dicts(batch, key)[source]#

Extracts the value of the given key from each dictionary in a list of dictionaries.

Parameters:
  • batch (List[dict]) – A list of dictionaries.

  • key (str) – The key whose values are to be extracted from each dictionary.

Returns:

A list of values associated with the specified key, in the same order as the dictionaries in the input batch.

Return type:

List

nemo_automodel.datasets.utils.pad_within_micro(batch, pad_token_id, pad_seq_len_divisible=None)[source]#

Pads each list in a batch of lists to the same length with a specified token.

Parameters:
  • batch (List[List[int]]) – A batch of sequences (e.g., token IDs), where each sequence is a list of integers.

  • pad_token_id (int) – The token ID to use for padding shorter sequences.

  • pad_seq_len_divisible (int) – The value to use for padding sequence length so that it is divisible by pad_seq_len_divisible.

Returns:

A batch of sequences where each inner list has been padded with the pad token to match the length of the longest sequence in the batch.

Return type:

List[List[int]]

nemo_automodel.datasets.utils.default_collater(batch, pad_token_id=0, pad_seq_len_divisible=None)[source]#

Default batch collator that handles padding and batching.

Parameters:
  • batch – A batch of examples.

  • pad_token_id – The token ID to use for padding.

  • pad_seq_len_divisible – If provided, pad sequence length to be divisible by this value.

Returns:

A dictionary containing batched tensors.

Return type:

dict

class nemo_automodel.datasets.utils.SFTSingleTurnPreprocessor(tokenizer)[source]#

Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.

Parameters:

tokenizer – Pre-trained tokenizer (HF).

Initialization

SFTSingleTurnPreprocessor constructor.

Parameters:

tokenizer – Pretrained tokenizer.

_tokenize_function(examples, dataset)[source]#
_compute_dataset_max_len(tokenized_ds)[source]#
_pad_function(max_len)[source]#
process(raw_dataset, ds)[source]#

Main processor entry.

Parameters:
  • raw_dataset (datasets.DatasetDict) – the dataset (e.g. returned by load_dataset)

  • ds (dataset) – the dataset with get_target method.

Returns:

tokenized + padded datasets (all splits preserved).

Return type:

datasets.DatasetDict