> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.loss.te_parallel_ce

## Module Contents

### Classes

| Name                                                                                              | Description                                                                                                            |
| ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| [`CrossEntropyFunction`](#nemo_automodel-components-loss-te_parallel_ce-CrossEntropyFunction)     | This class implements a custom autograd function for the Cross Entropy loss. The input tensor can be in BF16/FP32, the |
| [`TEParallelCrossEntropy`](#nemo_automodel-components-loss-te_parallel_ce-TEParallelCrossEntropy) | TransformerEngine tensor-parallel cross-entropy loss wrapper.                                                          |

### Data

[`HAVE_DTENSOR`](#nemo_automodel-components-loss-te_parallel_ce-HAVE_DTENSOR)

[`HAVE_TE_PARALLEL_CE`](#nemo_automodel-components-loss-te_parallel_ce-HAVE_TE_PARALLEL_CE)

[`MISSING_TE_PARALLEL_CE_MSG`](#nemo_automodel-components-loss-te_parallel_ce-MISSING_TE_PARALLEL_CE_MSG)

[`parallel_cross_entropy`](#nemo_automodel-components-loss-te_parallel_ce-parallel_cross_entropy)

### API

```python
class nemo_automodel.components.loss.te_parallel_ce.CrossEntropyFunction()
```

**Bases:** `Function`

This class implements a custom autograd function for the Cross Entropy loss. The input tensor can be in BF16/FP32, the
loss and gradient calculation happens in FP32 only. The returned loss is always in FP32, the input gradients are upcasted
to the dataype of the input.

```python
nemo_automodel.components.loss.te_parallel_ce.CrossEntropyFunction.backward(
    ctx,
    grad_output
)
```

staticmethod

The backward pass of the Cross Entropy loss.

Parameters:
ctx : The context object with saved tensors.
grad\_output (tensor): The tensor containing the gradient of the loss with respect to the output.

Returns:
tuple: A tuple with the gradients with respect to the inputs. The elements are tensors or None.

```python
nemo_automodel.components.loss.te_parallel_ce.CrossEntropyFunction.forward(
    ctx,
    _input,
    target,
    label_smoothing = 0.0,
    reduce_loss = False,
    dist_process_group = None,
    ignore_idx = -100
)
```

staticmethod

The forward pass of the Cross Entropy loss. If dist\_process\_group is passed for distributed loss calculation, the input to each
distributed rank should be (\*,V/world\_size). Note that each of the ranks should get equal shards along the V dimension.

Parameters:
ctx : The context object.
\_input (tensor): The input tensor of shape (B, SQ, V) or (SQ, B, V) where B is batch size, SQ is sequence length, V is vocab size.
target (tensor): The target tensor of shape (B,SQ) or (SQ, B) where each value is in \[0, V-1].
label\_smoothing (float): The amount of smoothing when computing the loss, where 0.0 means no smoothing.
reduce\_loss (bool): If true, returns the averaged loss across the B\*SQ dimension.
dist\_process\_group (torch.dist.ProcessGroup): The distributed process group the loss computation is split across, None if on 1 device.
ignore\_idx (int): The index for which loss and gradients are made to zero

Returns:
tensor: The computed loss.

```python
class nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy(
    ignore_index: int = -100,
    reduction: str = 'sum',
    tp_group: typing.Optional[torch.distributed.ProcessGroup] = None
)
```

TransformerEngine tensor-parallel cross-entropy loss wrapper.

```python
nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy.__call__(
    logits: torch.Tensor,
    labels: torch.Tensor,
    mask: typing.Optional[torch.Tensor] = None,
    num_label_tokens: typing.Optional[int] = None
) -> torch.Tensor
```

Compute parallel cross entropy loss that matches PyTorch's cross\_entropy behavior.

**Parameters:**

Input logits. Shape: \[B, T, V]

Target labels. Shape: \[B, T]

Mask to apply to the loss. Shape: \[B, T]

The number of non-padding tokens.

**Returns:** `torch.Tensor`

Computed loss tensor

```python
nemo_automodel.components.loss.te_parallel_ce.HAVE_DTENSOR = True
```

```python
nemo_automodel.components.loss.te_parallel_ce.HAVE_TE_PARALLEL_CE = HAVE_TRITON
```

```python
nemo_automodel.components.loss.te_parallel_ce.MISSING_TE_PARALLEL_CE_MSG = MISSING_TRITON_MSG
```

Cross Entropy Loss API from NVIDIA's TransformerEngine, available under the Apache License 2.0:
[https://github.com/NVIDIA/TransformerEngine](https://github.com/NVIDIA/TransformerEngine)

```python
nemo_automodel.components.loss.te_parallel_ce.parallel_cross_entropy = CrossEntropyFunction.apply
```