nemo_automodel.components.datasets.utils

Module Contents

Classes

Name	Description
`SFTSingleTurnPreprocessor`	Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.

Functions

Name	Description
`_indexed_mask_to_4d_block_causal`	Convert an indexed attention mask to a 4D block-causal mask.
`add_causal_masks_to_batch`	Add precomputed causal masks to an already-batched data dict.
`batchify`	Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.
`create_causal_mask_mapping`	Create causal mask mapping for pipeline parallelism.
`default_collater`	Default batch collator that handles padding and batching.
`extract_key_from_dicts`	Extracts the value of the given key from each dictionary in a list of dictionaries.
`find_last_non_pad_token`	Return the last non-padding index before a trailing padding run.
`get_pad_token_from_key`	Return the default pad token id for a batch field name.
`make_attention_mask_from_labels`	Build an attention mask from labels with trailing ignored positions.
`neat_packed_collater`	Collater for neat-packed LLM sequences.
`packed_sequence_thd_collater`	Collater for packed sequences in THD (total, hidden, depth) format.
`pad_within_micro`	Pads each list in a batch of lists to the same length with a specified token.

API

class nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor(
    tokenizer
)

Generic single-turn text-to-text SFT (supervised-fine-tuning) pre-processor.

Parameters:

tokenizer

Pre-trained tokenizer (HF).

preprocessing_num_workers

= 1

nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor._compute_dataset_max_len(
    tokenized_ds
)

nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor._pad_function(
    max_len
)

nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor._tokenize_function(
    examples,
    dataset
)

nemo_automodel.components.datasets.utils.SFTSingleTurnPreprocessor.process(
    raw_dataset,
    ds
)

Main processor entry.

Parameters:

raw_dataset

datasets.DatasetDict

the dataset (e.g. returned by load_dataset)

dataset

the dataset with get_target method.

Returns:

datasets.DatasetDict: tokenized + optionally padded datasets (all splits preserved).

nemo_automodel.components.datasets.utils._indexed_mask_to_4d_block_causal(
    attention_mask: torch.Tensor
) -> torch.Tensor

Convert an indexed attention mask to a 4D block-causal mask.

Parameters:

attention_mask

torch.Tensor

Integer tensor of shape [B, S] where each position contains the 1-based index of the sub-sequence it belongs to (0 = padding).

Returns: torch.Tensor

Bool tensor of shape [B, 1, S, S] suitable for

nemo_automodel.components.datasets.utils.add_causal_masks_to_batch(
    batch_dict,
    model_config
)

Add precomputed causal masks to an already-batched data dict.

This function is designed for datasets that yield complete batches (like MockIterableDataset), where we want to add mask precomputation as a separate processing step.

Parameters:

batch

A dict or list containing a single batched dict with tensors:

input_ids: [batch_size, seq_length]
position_ids: [batch_size, seq_length] (optional)
labels: [batch_size, seq_length]

model_config

HuggingFace model config for creating causal masks

precompute_masks

If False, skip mask creation (for compatibility with train_ft.py wrapper)

Returns:

Same batch with added causal_mask_mapping field

nemo_automodel.components.datasets.utils.batchify(
    tensor,
    default_tensor_cls = torch.LongTensor
)

Ensures that the input tensor has at least two dimensions by adding an extra batch dimension if necessary.

Parameters:

tensor

torch.Tensor

The input tensor to be batchified.

Returns:

torch.Tensor: The tensor with an extra dimension added if it was originally 1-dimensional.

nemo_automodel.components.datasets.utils.create_causal_mask_mapping(
    model_config,
    batch_size,
    seq_len,
    position_ids = None,
    attention_mask = None,
    device = None
)

Create causal mask mapping for pipeline parallelism.

This is the core mask creation logic that can be reused by different collate functions. Extracts common mask creation logic to avoid duplication between collate functions.

Parameters:

model_config

HuggingFace model config

batch_size

Batch size

seq_len

Sequence length

position_ids

Defaults to None

Optional position IDs tensor [batch_size, seq_len]

attention_mask

Defaults to None

Optional 2D attention mask tensor [batch_size, seq_len] for padding

device

Defaults to None

Device to create tensors on (defaults to cpu)

Returns:

Mapping of mask types to 4D mask tensors

“full_attention”: [batch_size, 1, seq_len, seq_len]
“sliding_attention”: [batch_size, 1, seq_len, seq_len] (if model uses sliding window)

nemo_automodel.components.datasets.utils.default_collater(
    batch,
    pad_seq_len_divisible = None
)

Default batch collator that handles padding and batching.

Parameters:

batch

A batch of examples.

pad_seq_len_divisible

Defaults to None

If provided, pad sequence length to be divisible by this value.

Returns:

A dictionary containing batched tensors.

nemo_automodel.components.datasets.utils.extract_key_from_dicts(
    batch,
    key
)

Extracts the value of the given key from each dictionary in a list of dictionaries.

Parameters:

batch

List[dict]

A list of dictionaries.

key

str

The key whose values are to be extracted from each dictionary.

Returns:

A list of values associated with the specified key, in the same order as

nemo_automodel.components.datasets.utils.find_last_non_pad_token(
    lst: list[int],
    value: int
) -> int | None

Return the last non-padding index before a trailing padding run.

nemo_automodel.components.datasets.utils.get_pad_token_from_key(
    val: str,
    pad_token_ids: typing.Optional[dict[str, int]] = None
) -> int | None

Return the default pad token id for a batch field name.

nemo_automodel.components.datasets.utils.make_attention_mask_from_labels(
    ids: list[int],
    ignore_token: int = -100
) -> list[int]

Build an attention mask from labels with trailing ignored positions.

nemo_automodel.components.datasets.utils.neat_packed_collater(
    batch: list[dict],
    attn_implementation: str = 'sdpa'
) -> dict

Collater for neat-packed LLM sequences.

Stacks input_ids, labels, position_ids and converts the indexed attention_mask to the format required by the attention backend.

For flash_attention_2: keeps the indexed 2D mask [B, S]. For sdpa / eager: converts to a 4D block-causal float mask.

Parameters:

batch

list[dict]

List of sample dicts produced by neat_pack_dataset.

attn_implementation

strDefaults to 'sdpa'

Attention backend ("flash_attention_2", "sdpa", or "eager").

Returns: dict

Dict with batched tensors ready for model forward.

nemo_automodel.components.datasets.utils.packed_sequence_thd_collater(
    batch
)

Collater for packed sequences in THD (total, hidden, depth) format.

This collater is designed for THD format, where multiple variable-length sequences are concatenated with/without padding tokens between them. The THD format represents sequences as (total_tokens, hidden_dim, depth) where total_tokens is the sum of all sequence lengths in the batch.

Unlike traditional padding-based approaches (BSHD/SBHD formats), this THD format:

Concatenates sequences directly: [a a a b b c c c c]
Uses seq_lens to identify sequence boundaries for attention computation
Supports optional identifier or padding tokens between sequences via seq_lens_padded

This collater supports both pipeline parallelism (PP) and non-PP use cases by:

Stacking token-level tensors (input_ids, labels, position_ids) along batch dimension
Padding and stacking seq_lens and seq_lens_padded with sentinel value -1000
Including ‘qkv_format’: ‘thd’ in the output to indicate THD format

When batch items lack packed-sequence metadata (seq_lens, seq_lens_padded, position_ids), such as samples from ChatDataset, this collater synthesizes the missing fields so that each sample is treated as a single-sequence “pack”. Variable-length sequences are padded to the longest length in the batch. This enables using THD format with TE context parallelism without requiring the dataset to perform actual sequence packing.

Parameters:

batch

List[dict]

A list of dictionaries, where each dictionary represents one example.

For pre-packed data, each dictionary should contain:

‘input_ids’: List[int] - Token IDs for all packed sequences (must be same length across batch)
‘labels’: List[int] - Labels for all packed sequences (must be same length across batch)
‘position_ids’: List[int] - Position IDs for all tokens (must be same length across batch)
‘seq_lens’: List[int] - Actual sequence lengths for each packed sequence
‘seq_lens_padded’: List[int] - Sequence lengths including identifier/padding tokens

For non-packed data (e.g. ChatDataset), each dictionary needs only:

‘input_ids’: List[int] - Token IDs (variable length across batch)
‘labels’: List[int] - Labels (same length as input_ids)
‘attention_mask’: List[int] - (optional) 1 for real tokens, 0 for padding

Example batch with 2 packed examples, both with 6 total tokens: [ { ‘input_ids’: [1, 2, 3, 99, 4, 5], # Two sequences: [1,2,3] and [4,5] with sep token 99 ‘labels’: [1, 2, 3, -100, 4, 5], ‘position_ids’: [0, 1, 2, 0, 0, 1], ‘seq_lens’: [3, 2], # Actual sequence lengths (excluding separator) ‘seq_lens_padded’: [4, 2] # Including separator token }, { ‘input_ids’: [6, 7, 99, 8, 9, 10], # Two sequences with separator ‘labels’: [6, 7, -100, 8, 9, 10], ‘position_ids’: [0, 1, 0, 0, 1, 2], ‘seq_lens’: [2, 3], ‘seq_lens_padded’: [3, 3] } ]

Returns:

A dictionary with batched tensors:

‘input_ids’: tensor of shape [batch_size, seq_len] - stacked token sequences
‘labels’: tensor of shape [batch_size, seq_len] - stacked labels
‘position_ids’: tensor of shape [batch_size, seq_len] - stacked position IDs
‘seq_lens’: tensor of shape [batch_size, max_num_packs] - padded sequence lengths
‘seq_lens_padded’: tensor of shape [batch_size, max_num_packs] - padded lengths with separators
‘qkv_format’: str - Always ‘thd’ to indicate THD format

nemo_automodel.components.datasets.utils.pad_within_micro(
    batch,
    pad_token_id,
    pad_seq_len_divisible = None
)

Pads each list in a batch of lists to the same length with a specified token.

Parameters:

batch

List[List[int]]

A batch of sequences (e.g., token IDs), where each sequence is a list of integers.

pad_token_id

int

The token ID to use for padding shorter sequences.

pad_seq_len_divisible

intDefaults to None

The value to use for padding sequence length so that it is divisible by pad_seq_len_divisible.

Returns:

List[List[int]]: A batch of sequences where each inner list has been padded with the pad