> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.loss.mtp

## Module Contents

### Classes

| Name                                                                               | Description                                                             |
| ---------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| [`MTPLossConfig`](#nemo_automodel-components-loss-mtp-MTPLossConfig)               | Typed config for the Multi-Token-Prediction auxiliary loss.             |
| [`PipelineCausalLMLoss`](#nemo_automodel-components-loss-mtp-PipelineCausalLMLoss) | Pipeline schedule loss that can add MTP auxiliary CE on the last stage. |

### Functions

| Name                                                                           | Description                                                    |
| ------------------------------------------------------------------------------ | -------------------------------------------------------------- |
| [`calculate_mtp_loss`](#nemo_automodel-components-loss-mtp-calculate_mtp_loss) | Compute the DeepSeek-V3 Multi-Token Prediction auxiliary loss. |

### API

```python
class nemo_automodel.components.loss.mtp.MTPLossConfig(
    scaling_factor: float | None = None,
    ignore_index: int = -100
)
```

Dataclass

Typed config for the Multi-Token-Prediction auxiliary loss.

MTP is gated on the model emitting per-depth outputs; this config only
carries its hyperparameters. `scaling_factor=None` keeps the
model-provided value (`out.mtp_loss_scaling_factor` /
`get_mtp_loss_scaling_factor`); set it to override.

```python
nemo_automodel.components.loss.mtp.MTPLossConfig.build(
    loss_fn: torch.nn.Module,
    model: torch.nn.Module
) -> nemo_automodel.components.loss.mtp.PipelineCausalLMLoss
```

Build the pipeline-schedule, MTP-aware loss for `loss_fn`/`model`.

```python
class nemo_automodel.components.loss.mtp.PipelineCausalLMLoss(
    loss_fn: torch.nn.Module,
    model: torch.nn.Module,
    scaling_factor: float | None = None,
    ignore_index: int = -100
)
```

**Bases:** `Module`

Pipeline schedule loss that can add MTP auxiliary CE on the last stage.

Per-microbatch `seq_idx` is read from a trailing element of the
last-stage output tuple — the model appends an `[B, S] int32` tail
when MTP is enabled. This binds each microbatch's seq\_idx to its loss
call via the PP runtime's output→loss contract, so the wiring is
schedule-agnostic. Legacy `cu_seqlens` (THD path) is a fallback for
models that don't emit a seq\_idx tail.

```python
nemo_automodel.components.loss.mtp.PipelineCausalLMLoss._extract_seq_idx_tail(
    output
) -> tuple[typing.Optional[torch.Tensor], object]
```

staticmethod

Detect and strip a trailing per-microbatch seq\_idx from output.

Convention: with MTP enabled the last-stage output is
`(logits, *mtp_per_depth_h, seq_idx)` with an `[B, S] int32`
tail — dtype alone discriminates.

```python
nemo_automodel.components.loss.mtp.PipelineCausalLMLoss.forward(
    output,
    labels: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.loss.mtp.calculate_mtp_loss(
    loss_fn,
    mtp_per_depth_h: list[torch.Tensor] | None = None,
    mtp_per_depth_logits: list[torch.Tensor] | None = None,
    labels: torch.Tensor,
    model: torch.nn.Module,
    scaling_factor: float = 0.1,
    num_label_tokens: typing.Optional[int] = None,
    ignore_index: int = -100,
    cu_seqlens: typing.Optional[torch.Tensor] = None,
    seq_idx: typing.Optional[torch.Tensor] = None,
    lm_weight: typing.Optional[torch.Tensor] = None
) -> torch.Tensor
```

Compute the DeepSeek-V3 Multi-Token Prediction auxiliary loss.

Each depth's CE is dispatched through :func:`calculate_loss` with the
same loss class as the main path, so MTP inherits FusedLinearCrossEntropy
/ MaskedCrossEntropy memory and numerical characteristics.

**Parameters:**

Configured per-token loss class (same instance the main
path uses).

Per-depth hidden states from the model's MTP head,
one `[B, S, H]` tensor per depth.

Original (unshifted) labels.

The wrapped model; used to fetch the shared LM head when the
loss class needs materialized logits (non-FusedLinearCE path).

Coefficient applied to the summed per-depth CE.

Total non-ignore label tokens (forwarded to the
base loss for sum-reduction normalization).

Label value masked out of the CE loss for the trailing
`k+1` rolled positions at depth `k`.

Optional cumulative sequence lengths `[num_seqs+1]`
(THD-pack layout). When supplied and `seq_idx` is not, builds
a per-token sub-sequence index via searchsorted. Without packing
this can be omitted.

Optional per-token sub-sequence index `[B, S]` (or `[S]`).
Equality classes are what matter; absolute values can be any
ints. Takes precedence over `cu_seqlens`. Used to mask label
rolls whose source position lies in a different sub-sequence.

Optional caller-materialized LM-head weight. Supplying this
lets the main loss and all MTP depths share one DTensor
`full_tensor()` gather on the FusedLinearCrossEntropy path.

**Returns:** `torch.Tensor`

Scalar MTP loss with autograd graph.