nemo_automodel.components.datasets.utils#
Module Contents#
Classes#
Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor. |
Functions#
Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary. |
|
Extracts the value of the given key from each dictionary in a list of dictionaries. |
|
Pads each list in a batch of lists to the same length with a specified token. |
|
Create causal mask mapping for pipeline parallelism. |
|
Add precomputed causal masks to an already-batched data dict. |
|
Default batch collator that handles padding and batching. |
|
Collater for packed sequences in THD (total, hidden, depth) format. |
API#
- nemo_automodel.components.datasets.utils.batchify(tensor, default_tensor_cls=torch.LongTensor)#
Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.
- Parameters:
tensor (torch.Tensor) â The input tensor to be batchified.
- Returns:
The tensor with an extra dimension added if it was originally 1-dimensional. Otherwise, the tensor is returned as-is.
- Return type:
torch.Tensor
- nemo_automodel.components.datasets.utils.extract_key_from_dicts(batch, key)#
Extracts the value of the given key from each dictionary in a list of dictionaries.
- Parameters:
batch (List[dict]) â A list of dictionaries.
key (str) â The key whose values are to be extracted from each dictionary.
- Returns:
A list of values associated with the specified key, in the same order as the dictionaries in the input batch.
- Return type:
List
- nemo_automodel.components.datasets.utils.pad_within_micro(batch, pad_token_id, pad_seq_len_divisible=None)#
Pads each list in a batch of lists to the same length with a specified token.
- Parameters:
batch (List[List[int]]) â A batch of sequences (e.g., token IDs), where each sequence is a list of integers.
pad_token_id (int) â The token ID to use for padding shorter sequences.
pad_seq_len_divisible (int) â The value to use for padding sequence length so that it is divisible by pad_seq_len_divisible.
- Returns:
A batch of sequences where each inner list has been padded with the pad token to match the length of the longest sequence in the batch.
- Return type:
List[List[int]]
- nemo_automodel.components.datasets.utils.find_last_non_pad_token(lst: list[int], value: int) int | None#
- nemo_automodel.components.datasets.utils.get_pad_token_from_key(
- val: str,
- pad_token_ids: Optional[dict[str, int]] = None,
- nemo_automodel.components.datasets.utils.make_attention_mask_from_labels(
- ids: list[int],
- ignore_token: int = -100,
- nemo_automodel.components.datasets.utils.create_causal_mask_mapping(
- model_config,
- batch_size,
- seq_len,
- position_ids=None,
- attention_mask=None,
- device=None,
Create causal mask mapping for pipeline parallelism.
This is the core mask creation logic that can be reused by different collate functions. Extracts common mask creation logic to avoid duplication between collate functions.
- Parameters:
model_config â HuggingFace model config
batch_size â Batch size
seq_len â Sequence length
position_ids â Optional position IDs tensor [batch_size, seq_len]
attention_mask â Optional 2D attention mask tensor [batch_size, seq_len] for padding
device â Device to create tensors on (defaults to cpu)
- Returns:
Mapping of mask types to 4D mask tensors - âfull_attentionâ: [batch_size, 1, seq_len, seq_len] - âsliding_attentionâ: [batch_size, 1, seq_len, seq_len] (if model uses sliding window)
- Return type:
dict
- nemo_automodel.components.datasets.utils.add_causal_masks_to_batch(batch_dict, model_config)#
Add precomputed causal masks to an already-batched data dict.
This function is designed for datasets that yield complete batches (like MockIterableDataset), where we want to add mask precomputation as a separate processing step.
- Parameters:
batch â
A dict or list containing a single batched dict with tensors:
input_ids: [batch_size, seq_length]
position_ids: [batch_size, seq_length] (optional)
labels: [batch_size, seq_length]
model_config â HuggingFace model config for creating causal masks
precompute_masks â If False, skip mask creation (for compatibility with train_ft.py wrapper)
- Returns:
Same batch with added causal_mask_mapping field
- Return type:
dict
- nemo_automodel.components.datasets.utils.default_collater(batch, pad_seq_len_divisible=None)#
Default batch collator that handles padding and batching.
- Parameters:
batch â A batch of examples.
pad_seq_len_divisible â If provided, pad sequence length to be divisible by this value.
- Returns:
A dictionary containing batched tensors.
- Return type:
dict
- nemo_automodel.components.datasets.utils.packed_sequence_thd_collater(batch)#
Collater for packed sequences in THD (total, hidden, depth) format.
This collater is designed for THD format, where multiple variable-length sequences are concatenated with/without padding tokens between them. The THD format represents sequences as (total_tokens, hidden_dim, depth) where total_tokens is the sum of all sequence lengths in the batch.
Unlike traditional padding-based approaches (BSHD/SBHD formats), this THD format:
Concatenates sequences directly: [a a a b b c c c c]
Uses seq_lens to identify sequence boundaries for attention computation
Supports optional identifier or padding tokens between sequences via seq_lens_padded
This collater supports both pipeline parallelism (PP) and non-PP use cases by:
Stacking token-level tensors (input_ids, labels, position_ids) along batch dimension
Padding and stacking seq_lens and seq_lens_padded with sentinel value -1000
Including âqkv_formatâ: âthdâ in the output to indicate THD format
IMPORTANT: All examples in the batch must have the same token sequence length for input_ids, labels, and position_ids. This is typically ensured by the dataset/packing logic that creates fixed-length packed sequences.
- Parameters:
batch (List[dict]) â
A list of dictionaries, where each dictionary represents one packed example. Each dictionary should contain:
âinput_idsâ: List[int] - Token IDs for all packed sequences (must be same length across batch)
âlabelsâ: List[int] - Labels for all packed sequences (must be same length across batch)
âposition_idsâ: List[int] - Position IDs for all tokens (must be same length across batch)
âseq_lensâ: List[int] - Actual sequence lengths for each packed sequence
âseq_lens_paddedâ: List[int] - Sequence lengths including identifier/padding tokens
Example batch with 2 examples, both with 6 total tokens: [ { âinput_idsâ: [1, 2, 3, 99, 4, 5], # Two sequences: [1,2,3] and [4,5] with sep token 99 âlabelsâ: [1, 2, 3, -100, 4, 5], âposition_idsâ: [0, 1, 2, 0, 0, 1], âseq_lensâ: [3, 2], # Actual sequence lengths (excluding separator) âseq_lens_paddedâ: [4, 2] # Including separator token }, { âinput_idsâ: [6, 7, 99, 8, 9, 10], # Two sequences with separator âlabelsâ: [6, 7, -100, 8, 9, 10], âposition_idsâ: [0, 1, 0, 0, 1, 2], âseq_lensâ: [2, 3], âseq_lens_paddedâ: [3, 3] } ]
- Returns:
A dictionary with batched tensors: - âinput_idsâ: tensor of shape [batch_size, seq_len] - stacked token sequences - âlabelsâ: tensor of shape [batch_size, seq_len] - stacked labels - âposition_idsâ: tensor of shape [batch_size, seq_len] - stacked position IDs - âseq_lensâ: tensor of shape [batch_size, max_num_packs] - padded sequence lengths - âseq_lens_paddedâ: tensor of shape [batch_size, max_num_packs] - padded lengths with separators - âqkv_formatâ: str - Always âthdâ to indicate THD format
Note: seq_lens and seq_lens_padded are padded with -1000 to handle variable number of packed sequences per example. These sentinel values should be filtered out before use.
- Return type:
dict
- class nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor(tokenizer)#
Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.
- Parameters:
tokenizer â Pre-trained tokenizer (HF).
Initialization
SFTSingleTurnPreprocessor constructor.
- Parameters:
tokenizer â Pretrained tokenizer.
- _tokenize_function(examples, dataset)#
- _compute_dataset_max_len(tokenized_ds)#
- _pad_function(max_len)#
- process(raw_dataset, ds)#
Main processor entry.
- Parameters:
raw_dataset (datasets.DatasetDict) â the dataset (e.g. returned by load_dataset)
ds (dataset) â the dataset with get_target method.
- Returns:
tokenized + optionally padded datasets (all splits preserved).
- Return type:
datasets.DatasetDict