nemo_automodel.datasets.utils
#
Module Contents#
Classes#
Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor. |
Functions#
Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary. |
|
Extracts the value of the given key from each dictionary in a list of dictionaries. |
|
Pads each list in a batch of lists to the same length with a specified token. |
|
Default batch collator that handles padding and batching. |
API#
- nemo_automodel.datasets.utils.batchify(tensor)[source]#
Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.
- Parameters:
tensor (torch.Tensor) – The input tensor to be batchified.
- Returns:
The tensor with an extra dimension added if it was originally 1-dimensional. Otherwise, the tensor is returned as-is.
- Return type:
torch.Tensor
- nemo_automodel.datasets.utils.extract_key_from_dicts(batch, key)[source]#
Extracts the value of the given key from each dictionary in a list of dictionaries.
- Parameters:
batch (List[dict]) – A list of dictionaries.
key (str) – The key whose values are to be extracted from each dictionary.
- Returns:
A list of values associated with the specified key, in the same order as the dictionaries in the input batch.
- Return type:
List
- nemo_automodel.datasets.utils.pad_within_micro(batch, pad_token_id, pad_seq_len_divisible=None)[source]#
Pads each list in a batch of lists to the same length with a specified token.
- Parameters:
batch (List[List[int]]) – A batch of sequences (e.g., token IDs), where each sequence is a list of integers.
pad_token_id (int) – The token ID to use for padding shorter sequences.
pad_seq_len_divisible (int) – The value to use for padding sequence length so that it is divisible by pad_seq_len_divisible.
- Returns:
A batch of sequences where each inner list has been padded with the pad token to match the length of the longest sequence in the batch.
- Return type:
List[List[int]]
- nemo_automodel.datasets.utils.default_collater(batch, pad_token_id=0, pad_seq_len_divisible=None)[source]#
Default batch collator that handles padding and batching.
- Parameters:
batch – A batch of examples.
pad_token_id – The token ID to use for padding.
pad_seq_len_divisible – If provided, pad sequence length to be divisible by this value.
- Returns:
A dictionary containing batched tensors.
- Return type:
dict
- class nemo_automodel.datasets.utils.SFTSingleTurnPreprocessor(tokenizer)[source]#
Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.
- Parameters:
tokenizer – Pre-trained tokenizer (HF).
Initialization
SFTSingleTurnPreprocessor constructor.
- Parameters:
tokenizer – Pretrained tokenizer.