> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.distributed.thd_utils

## Module Contents

### Functions

| Name                                                                                                          | Description                                                                           |
| ------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| [`process_input_for_thd`](#nemo_automodel-components-distributed-thd_utils-process_input_for_thd)             | Process inputs for THD (total, hidden, depth) format.                                 |
| [`split_batch_into_thd_chunks`](#nemo_automodel-components-distributed-thd_utils-split_batch_into_thd_chunks) | Process inputs for THD format by splitting batch into chunks for context parallelism. |

### API

```python
nemo_automodel.components.distributed.thd_utils.process_input_for_thd(
    batch: dict[str, torch.Tensor],
    seq_lens_padding_value: int = -1000,
    padding_token_id: int = 0
) -> dict[str, torch.Tensor]
```

Process inputs for THD (total, hidden, depth) format.

This function converts batched inputs from BSHD (batch, sequence, hidden, depth) format
to THD format for packed sequence processing. In THD format, the batch dimension is
collapsed and all sequences are concatenated along the sequence dimension. This supports
both 2D token IDs and 3D embeddings for pipeline parallelism scenarios.

The function filters out padding values in seq\_lens and seq\_lens\_padded (indicated by
seq\_lens\_padding\_value) and computes cumulative sequence lengths for efficient attention
computation with Transformer Engine or other packed sequence implementations.

**Parameters:**

Dictionary containing:

* 'input\_ids': Input tensor of shape \[batch\_size, seq\_len] for token IDs or
  \[batch\_size, seq\_len, hidden\_dim] for embeddings (in pipeline parallel scenarios)
* 'labels': Labels tensor of shape \[batch\_size, seq\_len]
* 'position\_ids': Position IDs tensor of shape \[batch\_size, seq\_len] (required)
* 'seq\_lens': Sequence lengths tensor of shape \[batch\_size, num\_packs] containing
  actual sequence lengths (excluding padding/separators). Values matching
  seq\_lens\_padding\_value indicate padding and are filtered out.
* 'seq\_lens\_padded': Padded sequence lengths tensor of shape \[batch\_size, num\_packs]
  containing lengths including separator tokens. Values matching
  seq\_lens\_padding\_value indicate padding and are filtered out.

Value used to indicate padding in seq\_lens/seq\_lens\_padded
tensors that should be filtered out (default: -1000)

Token ID used for padding in input\_ids to generate padding\_mask (default: 0)

**Returns:** `dict[str, torch.Tensor]`

Dictionary containing:

* 'input\_ids': Reshaped tensor of shape \[total\_tokens] for 2D token IDs or
  \[total\_tokens, hidden\_dim] for 3D embeddings
* 'labels': Reshaped labels tensor of shape \[total\_tokens]
* 'position\_ids': Reshaped tensor of shape \[total\_tokens]
* 'cu\_seqlens': Cumulative REAL sequence lengths tensor of shape \[num\_sequences + 1] (int32)
  where num\_sequences is the total count of non-padded sequences across the batch.
  Built from seq\_lens (the unpadded real lengths). When the trailing pack-pad is
  purely at the end (cp\_size == 1), the last entry is grown to total\_tokens to absorb
  that pad and avoid TE's `pad_between_seqs=True` path; see the absorption block in
  the function body for the gate.
* 'cu\_seqlens\_padded': (optional) Cumulative PADDED sequence lengths tensor of the same
  shape as `cu_seqlens`. Only emitted when it differs from `cu_seqlens` after
  absorption (i.e., when padding lives between sub-sequences, which is the CP case).
  Forwarded to TE as `cu_seqlens_q_padded` / `cu_seqlens_kv_padded` with
  `pad_between_seqs=True` so the kernel reads memory offsets from the padded
  variant while attending only over the real-length slots.
* 'max\_seqlen': Scalar int32 tensor equal to `max(cu_seqlens[i+1] - cu_seqlens[i])`
  after any absorption. Honors TE's contract that
  `max_seqlen_q &gt;= max(cu_seqlens_q[i+1] - cu_seqlens_q[i])`.
* 'padding\_mask': Boolean tensor of shape \[total\_tokens] indicating padding positions
* Non-tensor keys from input batch are preserved (e.g., 'qkv\_format')

```python
nemo_automodel.components.distributed.thd_utils.split_batch_into_thd_chunks(
    batch: dict[str, torch.Tensor],
    num_chunks: int,
    seq_lens_padding_value: int = -1000,
    padding_token_id: int = 0
) -> dict[str, torch.Tensor]
```

Process inputs for THD format by splitting batch into chunks for context parallelism.

This function splits the batch along the batch dimension into num\_chunks chunks,
processes each chunk with process\_input\_for\_thd, and stacks the tensor results.
This is useful for context parallelism where different chunks are processed on
different devices/ranks.

The cu\_seqlens tensors from different chunks may have different lengths depending on
the number of sequences in each chunk. These are padded with seq\_lens\_padding\_value
to ensure uniform length across chunks for stacking.

**Parameters:**

Dictionary containing input tensors with same structure as process\_input\_for\_thd:

* 'input\_ids': \[batch\_size, seq\_len] or \[batch\_size, seq\_len, hidden\_dim]
* 'labels': \[batch\_size, seq\_len]
* 'position\_ids': \[batch\_size, seq\_len] (required)
* 'seq\_lens': \[batch\_size, num\_packs]
* 'seq\_lens\_padded': \[batch\_size, num\_packs]

Number of chunks to split the batch into. Must evenly divide batch\_size.
If num\_chunks \<= 1, returns the result from process\_input\_for\_thd directly.

Value used to indicate padding in seq\_lens/seq\_lens\_padded
tensors and for padding cu\_seqlens to uniform length (default: -1000)

Token ID used for padding in input\_ids to generate padding\_mask (default: 0)

**Returns:** `dict[str, torch.Tensor]`

Dictionary containing: