nemo_automodel.components.datasets.utils
nemo_automodel.components.datasets.utils
Module Contents
Classes
Functions
API
Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.
Parameters:
Pre-trained tokenizer (HF).
Main processor entry.
Parameters:
the dataset (e.g. returned by load_dataset)
the dataset with get_target method.
Returns:
datasets.DatasetDict: tokenized + optionally padded datasets (all splits preserved).
Convert an indexed attention mask to a 4D block-causal mask.
Parameters:
Integer tensor of shape [B, S] where each
position contains the 1-based index of the sub-sequence it
belongs to (0 = padding).
Returns: torch.Tensor
Bool tensor of shape [B, 1, S, S] suitable for
Add precomputed causal masks to an already-batched data dict.
This function is designed for datasets that yield complete batches (like MockIterableDataset), where we want to add mask precomputation as a separate processing step.
Parameters:
A dict or list containing a single batched dict with tensors:
- input_ids: [batch_size, seq_length]
- position_ids: [batch_size, seq_length] (optional)
- labels: [batch_size, seq_length]
HuggingFace model config for creating causal masks
If False, skip mask creation (for compatibility with train_ft.py wrapper)
Returns:
Same batch with added causal_mask_mapping field
Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.
Parameters:
The input tensor to be batchified.
Returns:
torch.Tensor: The tensor with an extra dimension added if it was originally 1-dimensional.
Create causal mask mapping for pipeline parallelism.
This is the core mask creation logic that can be reused by different collate functions. Extracts common mask creation logic to avoid duplication between collate functions.
Parameters:
HuggingFace model config
Batch size
Sequence length
Optional position IDs tensor [batch_size, seq_len]
Optional 2D attention mask tensor [batch_size, seq_len] for padding
Device to create tensors on (defaults to cpu)
Returns:
Mapping of mask types to 4D mask tensors
- “full_attention”: [batch_size, 1, seq_len, seq_len]
- “sliding_attention”: [batch_size, 1, seq_len, seq_len] (if model uses sliding window)
Default batch collator that handles padding and batching.
Parameters:
A batch of examples.
If provided, pad sequence length to be divisible by this value.
Returns:
A dictionary containing batched tensors.
Extracts the value of the given key from each dictionary in a list of dictionaries.
Parameters:
A list of dictionaries.
The key whose values are to be extracted from each dictionary.
Returns:
A list of values associated with the specified key, in the same order as
Return the last non-padding index before a trailing padding run.
Return the default pad token id for a batch field name.
Build an attention mask from labels with trailing ignored positions.
Collater for neat-packed LLM sequences.
Stacks input_ids, labels, position_ids and converts the
indexed attention_mask to the format required by the attention backend.
For flash_attention_2: keeps the indexed 2D mask [B, S].
For sdpa / eager: converts to a 4D block-causal float mask.
Parameters:
List of sample dicts produced by neat_pack_dataset.
Attention backend ("flash_attention_2",
"sdpa", or "eager").
Returns: dict
Dict with batched tensors ready for model forward.
Collater for packed sequences in THD (total, hidden, depth) format.
This collater is designed for THD format, where multiple variable-length sequences are concatenated with/without padding tokens between them. The THD format represents sequences as (total_tokens, hidden_dim, depth) where total_tokens is the sum of all sequence lengths in the batch.
Unlike traditional padding-based approaches (BSHD/SBHD formats), this THD format:
- Concatenates sequences directly: [a a a b b c c c c]
- Uses seq_lens to identify sequence boundaries for attention computation
- Supports optional identifier or padding tokens between sequences via seq_lens_padded
This collater supports both pipeline parallelism (PP) and non-PP use cases by:
- Stacking token-level tensors (input_ids, labels, position_ids) along batch dimension
- Padding and stacking seq_lens and seq_lens_padded with sentinel value -1000
- Including ‘qkv_format’: ‘thd’ in the output to indicate THD format
When batch items lack packed-sequence metadata (seq_lens, seq_lens_padded, position_ids), such as samples from ChatDataset, this collater synthesizes the missing fields so that each sample is treated as a single-sequence “pack”. Variable-length sequences are padded to the longest length in the batch. This enables using THD format with TE context parallelism without requiring the dataset to perform actual sequence packing.
Parameters:
A list of dictionaries, where each dictionary represents one example.
For pre-packed data, each dictionary should contain:
- ‘input_ids’: List[int] - Token IDs for all packed sequences (must be same length across batch)
- ‘labels’: List[int] - Labels for all packed sequences (must be same length across batch)
- ‘position_ids’: List[int] - Position IDs for all tokens (must be same length across batch)
- ‘seq_lens’: List[int] - Actual sequence lengths for each packed sequence
- ‘seq_lens_padded’: List[int] - Sequence lengths including identifier/padding tokens
For non-packed data (e.g. ChatDataset), each dictionary needs only:
- ‘input_ids’: List[int] - Token IDs (variable length across batch)
- ‘labels’: List[int] - Labels (same length as input_ids)
- ‘attention_mask’: List[int] - (optional) 1 for real tokens, 0 for padding
Example batch with 2 packed examples, both with 6 total tokens: [ { ‘input_ids’: [1, 2, 3, 99, 4, 5], # Two sequences: [1,2,3] and [4,5] with sep token 99 ‘labels’: [1, 2, 3, -100, 4, 5], ‘position_ids’: [0, 1, 2, 0, 0, 1], ‘seq_lens’: [3, 2], # Actual sequence lengths (excluding separator) ‘seq_lens_padded’: [4, 2] # Including separator token }, { ‘input_ids’: [6, 7, 99, 8, 9, 10], # Two sequences with separator ‘labels’: [6, 7, -100, 8, 9, 10], ‘position_ids’: [0, 1, 0, 0, 1, 2], ‘seq_lens’: [2, 3], ‘seq_lens_padded’: [3, 3] } ]
Returns:
A dictionary with batched tensors:
- ‘input_ids’: tensor of shape [batch_size, seq_len] - stacked token sequences
- ‘labels’: tensor of shape [batch_size, seq_len] - stacked labels
- ‘position_ids’: tensor of shape [batch_size, seq_len] - stacked position IDs
- ‘seq_lens’: tensor of shape [batch_size, max_num_packs] - padded sequence lengths
- ‘seq_lens_padded’: tensor of shape [batch_size, max_num_packs] - padded lengths with separators
- ‘qkv_format’: str - Always ‘thd’ to indicate THD format
Pads each list in a batch of lists to the same length with a specified token.
Parameters:
A batch of sequences (e.g., token IDs), where each sequence is a list of integers.
The token ID to use for padding shorter sequences.
The value to use for padding sequence length so that it is divisible by pad_seq_len_divisible.
Returns:
List[List[int]]: A batch of sequences where each inner list has been padded with the pad