> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.utils

## Module Contents

### Classes

| Name                                                                                               | Description                                                                  |
| -------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| [`SFTSingleTurnPreprocessor`](#nemo_automodel-components-datasets-utils-SFTSingleTurnPreprocessor) | Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor. |

### Functions

| Name                                                                                                             | Description                                                                                                |
| ---------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| [`_indexed_mask_to_4d_block_causal`](#nemo_automodel-components-datasets-utils-_indexed_mask_to_4d_block_causal) | Convert an indexed attention mask to a 4D block-causal mask.                                               |
| [`add_causal_masks_to_batch`](#nemo_automodel-components-datasets-utils-add_causal_masks_to_batch)               | Add precomputed causal masks to an already-batched data dict.                                              |
| [`batchify`](#nemo_automodel-components-datasets-utils-batchify)                                                 | Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary. |
| [`create_causal_mask_mapping`](#nemo_automodel-components-datasets-utils-create_causal_mask_mapping)             | Create causal mask mapping for pipeline parallelism.                                                       |
| [`default_collater`](#nemo_automodel-components-datasets-utils-default_collater)                                 | Default batch collator that handles padding and batching.                                                  |
| [`extract_key_from_dicts`](#nemo_automodel-components-datasets-utils-extract_key_from_dicts)                     | Extracts the value of the given key from each dictionary in a list of dictionaries.                        |
| [`find_last_non_pad_token`](#nemo_automodel-components-datasets-utils-find_last_non_pad_token)                   | Return the last non-padding index before a trailing padding run.                                           |
| [`get_pad_token_from_key`](#nemo_automodel-components-datasets-utils-get_pad_token_from_key)                     | Return the default pad token id for a batch field name.                                                    |
| [`make_attention_mask_from_labels`](#nemo_automodel-components-datasets-utils-make_attention_mask_from_labels)   | Build an attention mask from labels with trailing ignored positions.                                       |
| [`neat_packed_collater`](#nemo_automodel-components-datasets-utils-neat_packed_collater)                         | Collater for neat-packed LLM sequences.                                                                    |
| [`packed_sequence_thd_collater`](#nemo_automodel-components-datasets-utils-packed_sequence_thd_collater)         | Collater for packed sequences in THD (total, hidden, depth) format.                                        |
| [`pad_within_micro`](#nemo_automodel-components-datasets-utils-pad_within_micro)                                 | Pads each list in a batch of lists to the same length with a specified token.                              |

### API

```python
class nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor(
    tokenizer
)
```

Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.

**Parameters:**

Pre-trained tokenizer (HF).

```python
nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor._compute_dataset_max_len(
    tokenized_ds
)
```

```python
nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor._pad_function(
    max_len
)
```

```python
nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor._tokenize_function(
    examples,
    dataset
)
```

```python
nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor.process(
    raw_dataset,
    ds
)
```

Main processor entry.

**Parameters:**

the dataset (e.g. returned by load\_dataset)

the dataset with get\_target method.

**Returns:**

datasets.DatasetDict: tokenized + optionally padded datasets (all splits preserved).

```python
nemo_automodel.components.datasets.utils._indexed_mask_to_4d_block_causal(
    attention_mask: torch.Tensor
) -> torch.Tensor
```

Convert an indexed attention mask to a 4D block-causal mask.

**Parameters:**

Integer tensor of shape `[B, S]` where each
position contains the 1-based index of the sub-sequence it
belongs to (0 = padding).

**Returns:** `torch.Tensor`

Bool tensor of shape `[B, 1, S, S]` suitable for

```python
nemo_automodel.components.datasets.utils.add_causal_masks_to_batch(
    batch_dict,
    model_config
)
```

Add precomputed causal masks to an already-batched data dict.

This function is designed for datasets that yield complete batches (like MockIterableDataset),
where we want to add mask precomputation as a separate processing step.

**Parameters:**

A dict or list containing a single batched dict with tensors:

* input\_ids: \[batch\_size, seq\_length]
* position\_ids: \[batch\_size, seq\_length] (optional)
* labels: \[batch\_size, seq\_length]

HuggingFace model config for creating causal masks

If False, skip mask creation (for compatibility with train\_ft.py wrapper)

**Returns:**

Same batch with added causal\_mask\_mapping field

```python
nemo_automodel.components.datasets.utils.batchify(
    tensor,
    default_tensor_cls = torch.LongTensor
)
```

Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.

**Parameters:**

The input tensor to be batchified.

**Returns:**

torch.Tensor:  The tensor with an extra dimension added if it was originally 1-dimensional.

```python
nemo_automodel.components.datasets.utils.create_causal_mask_mapping(
    model_config,
    batch_size,
    seq_len,
    position_ids = None,
    attention_mask = None,
    device = None
)
```

Create causal mask mapping for pipeline parallelism.

This is the core mask creation logic that can be reused by different collate functions.
Extracts common mask creation logic to avoid duplication between collate functions.

**Parameters:**

HuggingFace model config

Batch size

Sequence length

Optional position IDs tensor \[batch\_size, seq\_len]

Optional 2D attention mask tensor \[batch\_size, seq\_len] for padding

Device to create tensors on (defaults to cpu)

**Returns:**

Mapping of mask types to 4D mask tensors

* "full\_attention": \[batch\_size, 1, seq\_len, seq\_len]
* "sliding\_attention": \[batch\_size, 1, seq\_len, seq\_len] (if model uses sliding window)

```python
nemo_automodel.components.datasets.utils.default_collater(
    batch,
    pad_seq_len_divisible = None
)
```

Default batch collator that handles padding and batching.

**Parameters:**

A batch of examples.

If provided, pad sequence length to be divisible by this value.

**Returns:**

A dictionary containing batched tensors.

```python
nemo_automodel.components.datasets.utils.extract_key_from_dicts(
    batch,
    key
)
```

Extracts the value of the given key from each dictionary in a list of dictionaries.

**Parameters:**

A list of dictionaries.

The key whose values are to be extracted from each dictionary.

**Returns:**

A list of values associated with the specified key, in the same order as

```python
nemo_automodel.components.datasets.utils.find_last_non_pad_token(
    lst: list[int],
    value: int
) -> int | None
```

Return the last non-padding index before a trailing padding run.

```python
nemo_automodel.components.datasets.utils.get_pad_token_from_key(
    val: str,
    pad_token_ids: typing.Optional[dict[str, int]] = None
) -> int | None
```

Return the default pad token id for a batch field name.

```python
nemo_automodel.components.datasets.utils.make_attention_mask_from_labels(
    ids: list[int],
    ignore_token: int = -100
) -> list[int]
```

Build an attention mask from labels with trailing ignored positions.

```python
nemo_automodel.components.datasets.utils.neat_packed_collater(
    batch: list[dict],
    attn_implementation: str = 'sdpa'
) -> dict
```

Collater for neat-packed LLM sequences.

Stacks `input_ids`, `labels`, `position_ids` and converts the
indexed `attention_mask` to the format required by the attention backend.

For `flash_attention_2`: keeps the indexed 2D mask `[B, S]`.
For `sdpa` / `eager`: converts to a 4D block-causal float mask.

**Parameters:**

List of sample dicts produced by `neat_pack_dataset`.

Attention backend (`"flash_attention_2"`,
`"sdpa"`, or `"eager"`).

**Returns:** `dict`

Dict with batched tensors ready for model forward.

```python
nemo_automodel.components.datasets.utils.packed_sequence_thd_collater(
    batch
)
```

Collater for packed sequences in THD (total, hidden, depth) format.

This collater is designed for THD format, where multiple variable-length
sequences are concatenated with/without padding tokens between them. The THD format represents
sequences as (total\_tokens, hidden\_dim, depth) where total\_tokens is the sum of all sequence
lengths in the batch.

Unlike traditional padding-based approaches (BSHD/SBHD formats), this THD format:

* Concatenates sequences directly: \[a a a b b c c c c]
* Uses seq\_lens to identify sequence boundaries for attention computation
* Supports optional identifier or padding tokens between sequences via seq\_lens\_padded

This collater supports both pipeline parallelism (PP) and non-PP use cases by:

* Stacking token-level tensors (input\_ids, labels, position\_ids) along batch dimension
* Padding and stacking seq\_lens and seq\_lens\_padded with sentinel value -1000
* Including 'qkv\_format': 'thd' in the output to indicate THD format

When batch items lack packed-sequence metadata (seq\_lens, seq\_lens\_padded, position\_ids),
such as samples from ChatDataset, this collater synthesizes the missing fields so that each
sample is treated as a single-sequence "pack". Variable-length sequences are padded to the
longest length in the batch. This enables using THD format with TE context parallelism
without requiring the dataset to perform actual sequence packing.

**Parameters:**

A list of dictionaries, where each dictionary represents one example.

For pre-packed data, each dictionary should contain:

* 'input\_ids': List\[int] - Token IDs for all packed sequences (must be same length across batch)
* 'labels': List\[int] - Labels for all packed sequences (must be same length across batch)
* 'position\_ids': List\[int] - Position IDs for all tokens (must be same length across batch)
* 'seq\_lens': List\[int] - Actual sequence lengths for each packed sequence
* 'seq\_lens\_padded': List\[int] - Sequence lengths including identifier/padding tokens

For non-packed data (e.g. ChatDataset), each dictionary needs only:

* 'input\_ids': List\[int] - Token IDs (variable length across batch)
* 'labels': List\[int] - Labels (same length as input\_ids)
* 'attention\_mask': List\[int] - (optional) 1 for real tokens, 0 for padding

Example batch with 2 packed examples, both with 6 total tokens:
\[
\{
'input\_ids': \[1, 2, 3, 99, 4, 5],  # Two sequences: \[1,2,3] and \[4,5] with sep token 99
'labels': \[1, 2, 3, -100, 4, 5],
'position\_ids': \[0, 1, 2, 0, 0, 1],
'seq\_lens': \[3, 2],  # Actual sequence lengths (excluding separator)
'seq\_lens\_padded': \[4, 2]  # Including separator token
},
\{
'input\_ids': \[6, 7, 99, 8, 9, 10],  # Two sequences with separator
'labels': \[6, 7, -100, 8, 9, 10],
'position\_ids': \[0, 1, 0, 0, 1, 2],
'seq\_lens': \[2, 3],
'seq\_lens\_padded': \[3, 3]
}
]

**Returns:**

A dictionary with batched tensors:

* 'input\_ids': tensor of shape \[batch\_size, seq\_len] - stacked token sequences
* 'labels': tensor of shape \[batch\_size, seq\_len] - stacked labels
* 'position\_ids': tensor of shape \[batch\_size, seq\_len] - stacked position IDs
* 'seq\_lens': tensor of shape \[batch\_size, max\_num\_packs] - padded sequence lengths
* 'seq\_lens\_padded': tensor of shape \[batch\_size, max\_num\_packs] - padded lengths with separators
* 'qkv\_format': str - Always 'thd' to indicate THD format

```python
nemo_automodel.components.datasets.utils.pad_within_micro(
    batch,
    pad_token_id,
    pad_seq_len_divisible = None
)
```

Pads each list in a batch of lists to the same length with a specified token.

**Parameters:**

A batch of sequences (e.g., token IDs), where each sequence
is a list of integers.

The token ID to use for padding shorter sequences.

The value to use for padding sequence length so that it is
divisible by pad\_seq\_len\_divisible.

**Returns:**

List\[List\[int]]: A batch of sequences where each inner list has been padded with the pad