> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Nemotron-3-Ultra

Each target below is a validated aggregated vLLM deployment of Nemotron-3-Ultra — NVIDIA's \~550B hybrid Mamba/Attention/MoE model (\~55B active) — with MTP speculative decoding (1 token) and KV-aware routing; the B200 agentic target measured 310.8 system output tok/s per GPU on its trace. Pick your GPU and workload; every command on this page updates to match.

<p>
  Choose your deployment target
</p>

GPU

B200 Recommended

<input type="radio" id="recipe-sku-h200" name="recipe-sku" value="h200" />

H200

Workload

Chat

<input type="radio" id="recipe-usecase-agentic" name="recipe-usecase" value="agentic" />

Agentic

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

<b>Precision</b> NVFP4 + FP8

<b>GPUs</b> 4x B200 per worker, TP4 + EP

<b>Spec decode</b> MTP, 1 token

<b>Routing</b> KV-aware

<b>Workload</b> Chat 8K/1K Moontrace, 70% KV reuse

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

<b>Precision</b> NVFP4 + FP8

<b>GPUs</b> 4x B200 per worker, TP4 + EP

<b>Spec decode</b> MTP, 1 token

<b>Routing</b> KV-aware

<b>Workload</b> Agentic 64K/400 Moontrace, 90% KV reuse

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

<b>Precision</b> NVFP4 + FP8

<b>GPUs</b> 8x H200 per worker, TP8 + EP

<b>Spec decode</b> MTP, 1 token

<b>Routing</b> KV-aware

<b>Workload</b> Chat 8K/1K Moontrace, 70% KV reuse

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

<b>Precision</b> NVFP4 + FP8

<b>GPUs</b> 8x H200 per worker, TP8 + EP

<b>Spec decode</b> MTP, 1 token

<b>Routing</b> KV-aware

<b>Workload</b> Agentic 64K/400 Moontrace, 90% KV reuse

## Prerequisites

* A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs served) and **4x B200 per aggregated worker** available (8x B200 for the disaggregated fallback).
* An NGC image pull secret named `nvcr-secret` for `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-nemotron-ultra-dev.1`.
* A Hugging Face token with access to `nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4`.
* A `shared-model-cache` PVC containing the tokenizer-patched Ultra model view, or permission to create and populate it with the manifests in `model-cache/` (\~1200 Gi).

- A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs served) and **8x H200 per worker** available.
- An NGC image pull secret named `nvcr-secret` for `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-nemotron-ultra-dev.1`.
- A Hugging Face token with access to `nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4`.
- A `shared-model-cache` PVC containing the tokenizer-patched Ultra model view, or permission to create and populate it with the manifests in `model-cache/` (\~1200 Gi).

Create the namespace and token secret:

```bash
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="$HF_TOKEN" \
  -n ${NAMESPACE}
```

Edit namespace, storage class, image tags, node selectors, and cluster-specific placement in the manifests before applying them.

## Deploy

Create and populate the model cache, then validate the patched model view before deploying any server:

```bash
# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-cache.yaml -n ${NAMESPACE}

# 2. Download the checkpoint and build the tokenizer-patched model view.
kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/nemotron-ultra-model-download -n ${NAMESPACE} --timeout=12h

# 3. Validate the patched model view (model/tokenizer/parser files).
kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-validate.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/nemotron-ultra-model-validate -n ${NAMESPACE} --timeout=30m
```

Then deploy:

```bash
kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-b200-chat-mtp/deploy.yaml -n ${NAMESPACE}
kubectl get dgd ultra-agg-b200-chat-mtp -n ${NAMESPACE} -w
```

```bash
kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-b200-agentic-mtp/deploy.yaml -n ${NAMESPACE}
kubectl get dgd ultra-agg-b200-agentic-mtp -n ${NAMESPACE} -w
```

### Alternate topology: disaggregated fallback (1P1D, no MTP)

A disaggregated B200 agentic fallback splits prefill and decode into separate TP4 workers (8x B200 total: 4 prefill + 4 decode) with KV-aware routing plus P/D transfer, and runs without MTP. Its frontend service is `ultra-disagg-b200-1p1d-agentic-nomtp-frontend`; benchmark it by retargeting the same perf Job at that endpoint with the agentic trace at concurrency 32:

```bash
kubectl apply -f recipes/nemotron-3-ultra/vllm/disagg-b200-agentic/deploy.yaml -n ${NAMESPACE}
kubectl get dgd ultra-disagg-b200-1p1d-agentic-nomtp -n ${NAMESPACE} -w
```

Measured on the 15% agentic trace at concurrency 32: **61.6 user output tok/s** and **231.1 system output tok/s/GPU** (also listed in the performance table below).

```bash
kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-h200-chat-mtp/deploy.yaml -n ${NAMESPACE}
kubectl get dgd ultra-agg-h200-chat-mtp -n ${NAMESPACE} -w
```

```bash
kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-h200-agentic-mtp/deploy.yaml -n ${NAMESPACE}
kubectl get dgd ultra-agg-h200-agentic-mtp -n ${NAMESPACE} -w
```

## Smoke Test

Send a test request to verify the deployment serves traffic:

```bash
kubectl port-forward svc/ultra-agg-b200-chat-mtp-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/ultra-agg-b200-agentic-mtp-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/ultra-agg-h200-chat-mtp-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/ultra-agg-h200-agentic-mtp-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
MODEL_ID=nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${MODEL_ID}\",
       \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
       \"max_tokens\":64,
       \"chat_template_kwargs\":{\"enable_thinking\":false,\"force_nonempty_content\":true}}"
```

## Benchmark

A single AIPerf trace-replay Job ([`perf/perf.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/perf/perf.yaml)) covers every target — only `ENDPOINT`, `TRACE_FILE`, and `CONCURRENCY` change in its env block. First stage the bundled Moontrace files from [`recipes/nemotron-3-ultra/perf/traces/`](https://github.com/ai-dynamo/dynamo/tree/main/recipes/nemotron-3-ultra/perf/traces) onto the `shared-model-cache` PVC:

```bash
kubectl run pvc-helper -n ${NAMESPACE} \
  --image=busybox:1.36 --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/opt/models"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"shared-model-cache"}}]}}' \
  --command -- sleep 3600

kubectl exec -n ${NAMESPACE} pvc-helper -- mkdir -p /opt/models/traces
kubectl cp recipes/nemotron-3-ultra/perf/traces/. ${NAMESPACE}/pvc-helper:/opt/models/traces/
```

Set `ENDPOINT` to `ultra-agg-b200-chat-mtp-frontend:8000` (the Job default) with the chat trace at concurrency 18, then apply. The Job wraps this AIPerf raw Moontrace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer-trust-remote-code \
  --input-file /opt/models/traces/nim_turbo_8k_1k_70kv_chat_new_noschedule_short_15perc.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://ultra-agg-b200-chat-mtp-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 18 --random-seed 42
```

Set `ENDPOINT` to `ultra-agg-b200-agentic-mtp-frontend:8000` with the agentic trace at concurrency 20, then apply. The Job wraps this AIPerf raw Moontrace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer-trust-remote-code \
  --input-file /opt/models/traces/nim_turbo_64k_400_90kv_agent_new_noschedule_short_15perc.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://ultra-agg-b200-agentic-mtp-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 20 --random-seed 42
```

For the disaggregated fallback, point `ENDPOINT` at `ultra-disagg-b200-1p1d-agentic-nomtp-frontend:8000` with the same agentic trace at concurrency 32.

Set `ENDPOINT` to `ultra-agg-h200-chat-mtp-frontend:8000` with the chat trace at concurrency 10, then apply. The Job wraps this AIPerf raw Moontrace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer-trust-remote-code \
  --input-file /opt/models/traces/nim_turbo_8k_1k_70kv_chat_new_noschedule_short_15perc.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://ultra-agg-h200-chat-mtp-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 10 --random-seed 42
```

Set `ENDPOINT` to `ultra-agg-h200-agentic-mtp-frontend:8000` with the agentic trace at concurrency 8, then apply. The Job wraps this AIPerf raw Moontrace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tokenizer-trust-remote-code \
  --input-file /opt/models/traces/nim_turbo_64k_400_90kv_agent_new_noschedule_short_15perc.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://ultra-agg-h200-agentic-mtp-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 8 --random-seed 42
```

```bash
kubectl apply -f recipes/nemotron-3-ultra/perf/perf.yaml -n ${NAMESPACE}
kubectl logs -n ${NAMESPACE} -l job-name=ultra-bench -f
kubectl wait --for=condition=Complete job/ultra-bench -n ${NAMESPACE} --timeout=7200s
```

Artifacts land on the PVC under `/opt/models/perf/<epoch>_ultra-bench/`. 15% and 30% prefix-slice traces are provided for shorter runs. For concurrency sweeps, delete the worker pods between runs so residual KV/prefix-cache state does not skew results — see the [benchmark README](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/perf/README.md) for the full workflow, artifact layout, and tunable environment variables.

## Expected Performance

Each target is tuned for its workload shape:

| Workload | Median ISL | Median OSL | KV cache hit rate |
| -------- | ---------: | ---------: | ----------------: |
| Chat     |         8K |         1K |               70% |
| Agentic  |        64K |        400 |               90% |

B200 rows use 15% raw Moontrace replay with `raw_direct_no_filter` trace semantics; H200 rows use 300-sample replay evidence. `User output tok/s` is Gen TPS/user p50 from AIPerf; `System output tok/s/GPU` is TPS/GPU. Your selected target's rows are highlighted:

<table>
  <thead>
    <tr><th>Recipe</th><th>GPU</th><th>Topology</th><th>Workload</th><th>MTP</th><th>Concurrency</th><th>User output tok/s</th><th>System output tok/s/GPU</th></tr>
  </thead>

  <tbody>
    <tr data-sku="b200" data-usecase="chat">
      <td><code>vllm/agg-b200-chat-mtp/deploy.yaml</code></td>

      <td>B200</td>

      <td>AGG</td>

      <td>chat</td>

      <td>yes</td>

      <td>18</td>

      <td>52.0</td>

      <td>201.4</td>
    </tr>

    <tr data-sku="b200" data-usecase="chat">
      <td><code>vllm/agg-b200-chat-nomtp/deploy.yaml</code></td>

      <td>B200</td>

      <td>AGG</td>

      <td>chat</td>

      <td>no</td>

      <td>16</td>

      <td>51.0</td>

      <td>181.3</td>
    </tr>

    <tr data-sku="b200" data-usecase="agentic">
      <td><code>vllm/agg-b200-agentic-mtp/deploy.yaml</code></td>

      <td>B200</td>

      <td>AGG</td>

      <td>agentic</td>

      <td>yes</td>

      <td>20</td>

      <td>80.6</td>

      <td>310.8</td>
    </tr>

    <tr data-sku="b200" data-usecase="agentic">
      <td><code>vllm/agg-b200-agentic-nomtp/deploy.yaml</code></td>

      <td>B200</td>

      <td>AGG</td>

      <td>agentic</td>

      <td>no</td>

      <td>8</td>

      <td>99.5</td>

      <td>175.9</td>
    </tr>

    <tr data-sku="b200" data-usecase="agentic">
      <td><code>vllm/disagg-b200-agentic/deploy.yaml</code></td>

      <td>B200</td>

      <td>1P1D</td>

      <td>agentic</td>

      <td>no</td>

      <td>32</td>

      <td>61.6</td>

      <td>231.1</td>
    </tr>

    <tr data-sku="h200" data-usecase="chat">
      <td><code>vllm/agg-h200-chat-mtp/deploy.yaml</code></td>

      <td>H200</td>

      <td>AGG</td>

      <td>chat</td>

      <td>yes</td>

      <td>10</td>

      <td>58.7</td>

      <td>46.8</td>
    </tr>

    <tr data-sku="h200" data-usecase="chat">
      <td><code>vllm/agg-h200-chat-nomtp/deploy.yaml</code></td>

      <td>H200</td>

      <td>AGG</td>

      <td>chat</td>

      <td>no</td>

      <td>8</td>

      <td>54.2</td>

      <td>43.0</td>
    </tr>

    <tr data-sku="h200" data-usecase="agentic">
      <td><code>vllm/agg-h200-agentic-mtp/deploy.yaml</code></td>

      <td>H200</td>

      <td>AGG</td>

      <td>agentic</td>

      <td>yes</td>

      <td>8</td>

      <td>53.2</td>

      <td>27.4</td>
    </tr>

    <tr data-sku="h200" data-usecase="agentic">
      <td><code>vllm/agg-h200-agentic-nomtp/deploy.yaml</code></td>

      <td>H200</td>

      <td>AGG</td>

      <td>agentic</td>

      <td>no</td>

      <td>8</td>

      <td>52.3</td>

      <td>26.5</td>
    </tr>
  </tbody>
</table>

Treat each row together with its matching recipe, image, trace, and server-shape artifacts.

## Compare All Targets

All targets serve `nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4` on the dedicated dev runtime image `vllm-runtime:1.3.0-nemotron-ultra-dev.1` with KV-aware routing and a 262144 max model length:

|                           | B200 chat    | H200 chat    | B200 agentic  | H200 agentic  | B200 disagg agentic              |
| ------------------------- | ------------ | ------------ | ------------- | ------------- | -------------------------------- |
| **GPUs**                  | 4x B200      | 8x H200      | 4x B200       | 8x H200       | 4x B200 prefill + 4x B200 decode |
| **Mode**                  | aggregated   | aggregated   | aggregated    | aggregated    | disaggregated 1P1D               |
| **Parallelism**           | TP4 + EP     | TP8 + EP     | TP4 + EP      | TP8 + EP      | TP4 prefill + TP4 decode         |
| **Spec decode**           | MTP, 1 token | MTP, 1 token | MTP, 1 token  | MTP, 1 token  | no MTP                           |
| **Max sequences**         | 64           | 16           | 24            | 32            | 32                               |
| **Reference concurrency** | 18           | 10           | 20            | 8             | 32                               |
| **Workload**              | Chat trace   | Chat trace   | Agentic trace | Agentic trace | Agentic trace                    |

## Notes

* This is a Day-0 recipe on a dedicated dev runtime image (`vllm-runtime:1.3.0-nemotron-ultra-dev.1`); it is functional and benchmarked but not yet promoted to a release runtime image.
* The recipes pin `VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel` and pass `--no-enable-flashinfer-autotune` on vLLM workers. Do not remove these unless rerunning the benchmark qualification — they select the non-FlashInfer FP8 linear kernel path and avoid a measured vLLM 0.22 FlashInfer FP8 regression.
* No-MTP fallback manifests are included for every aggregated target at `vllm/agg-<sku>-<usecase>-nomtp/deploy.yaml`; their DGD names carry the `-nomtp` suffix, and their measured rows appear in the performance table above.
* Reasoning is controlled per request via `chat_template_kwargs` (`enable_thinking`, `force_nonempty_content`) and `nvext.max_thinking_tokens`. Do not send `force_nonempty_content` as a top-level request parameter. Top-level reasoning controls such as `include_reasoning` and `reasoning_effort` are part of shared Dynamo API compatibility work, not Ultra-specific failures.
* Raw Moontrace replay may contain over-context or pathological long-generation rows. Preserve them as HTTP/error evidence rather than dropping them silently.
* Tool calling uses the `qwen3_coder` parser; reasoning parsing uses the model-local `ultra_v3_reasoning_parser.py` (validated by the model-validate Job).

## Source

* Source README: [recipes/nemotron-3-ultra/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/README.md)
* Benchmark workflow: [recipes/nemotron-3-ultra/perf/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/perf/README.md) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/perf/perf.yaml)
* B200 chat + MTP: [vllm/agg-b200-chat-mtp/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/vllm/agg-b200-chat-mtp/deploy.yaml)
* B200 agentic + MTP: [vllm/agg-b200-agentic-mtp/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/vllm/agg-b200-agentic-mtp/deploy.yaml)
* H200 chat + MTP: [vllm/agg-h200-chat-mtp/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/vllm/agg-h200-chat-mtp/deploy.yaml)
* H200 agentic + MTP: [vllm/agg-h200-agentic-mtp/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/vllm/agg-h200-agentic-mtp/deploy.yaml)
* B200 disaggregated agentic: [vllm/disagg-b200-agentic/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-ultra/vllm/disagg-b200-agentic/deploy.yaml)
* No-MTP fallbacks: [vllm/](https://github.com/ai-dynamo/dynamo/tree/main/recipes/nemotron-3-ultra/vllm) (`agg-*-nomtp/deploy.yaml`)
* Model cache setup: [model-cache/](https://github.com/ai-dynamo/dynamo/tree/main/recipes/nemotron-3-ultra/model-cache) (`model-cache.yaml`, `model-download.yaml`, `model-validate.yaml`)
Recipe	GPU	Topology	Workload	MTP	Concurrency	User output tok/s	System output tok/s/GPU
`vllm/agg-b200-chat-mtp/deploy.yaml`	B200	AGG	chat	yes	18	52.0	201.4
`vllm/agg-b200-chat-nomtp/deploy.yaml`	B200	AGG	chat	no	16	51.0	181.3
`vllm/agg-b200-agentic-mtp/deploy.yaml`	B200	AGG	agentic	yes	20	80.6	310.8
`vllm/agg-b200-agentic-nomtp/deploy.yaml`	B200	AGG	agentic	no	8	99.5	175.9
`vllm/disagg-b200-agentic/deploy.yaml`	B200	1P1D	agentic	no	32	61.6	231.1
`vllm/agg-h200-chat-mtp/deploy.yaml`	H200	AGG	chat	yes	10	58.7	46.8
`vllm/agg-h200-chat-nomtp/deploy.yaml`	H200	AGG	chat	no	8	54.2	43.0
`vllm/agg-h200-agentic-mtp/deploy.yaml`	H200	AGG	agentic	yes	8	53.2	27.4
`vllm/agg-h200-agentic-nomtp/deploy.yaml`	H200	AGG	agentic	no	8	52.3	26.5