Nemotron-3-Ultra | NVIDIA Dynamo Documentation

Each target below is a validated aggregated vLLM deployment of Nemotron-3-Ultra — NVIDIA’s ~550B hybrid Mamba/Attention/MoE model (~55B active) — with MTP speculative decoding (1 token) and KV-aware routing; the B200 agentic target measured 310.8 system output tok/s per GPU on its trace. Pick your GPU and workload; every command on this page updates to match.

Choose your deployment target

GPUB200 RecommendedH200

WorkloadChatAgentic

Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 4x B200 per worker, TP4 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Chat 8K/1K Moontrace, 70% KV reuse

Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 4x B200 per worker, TP4 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Agentic 64K/400 Moontrace, 90% KV reuse

Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 8x H200 per worker, TP8 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Chat 8K/1K Moontrace, 70% KV reuse

Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 8x H200 per worker, TP8 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Agentic 64K/400 Moontrace, 90% KV reuse

Prerequisites

A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs served) and 4x B200 per aggregated worker available (8x B200 for the disaggregated fallback).
An NGC image pull secret named nvcr-secret for nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-nemotron-ultra-dev.1.
A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4.
A shared-model-cache PVC containing the tokenizer-patched Ultra model view, or permission to create and populate it with the manifests in model-cache/ (~1200 Gi).

A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs served) and 8x H200 per worker available.
An NGC image pull secret named nvcr-secret for nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-nemotron-ultra-dev.1.
A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4.
A shared-model-cache PVC containing the tokenizer-patched Ultra model view, or permission to create and populate it with the manifests in model-cache/ (~1200 Gi).

Create the namespace and token secret:

$ export NAMESPACE=your-namespace
$ kubectl create namespace ${NAMESPACE}
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="$HF_TOKEN" \
>   -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, and cluster-specific placement in the manifests before applying them.

Deploy

Create and populate the model cache, then validate the patched model view before deploying any server:

$ # 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$ kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-cache.yaml -n ${NAMESPACE}
$ 
$ # 2. Download the checkpoint and build the tokenizer-patched model view.
$ kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-download.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/nemotron-ultra-model-download -n ${NAMESPACE} --timeout=12h
$ 
$ # 3. Validate the patched model view (model/tokenizer/parser files).
$ kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-validate.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/nemotron-ultra-model-validate -n ${NAMESPACE} --timeout=30m

Then deploy:

$ kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-b200-chat-mtp/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd ultra-agg-b200-chat-mtp -n ${NAMESPACE} -w

$ kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-b200-agentic-mtp/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd ultra-agg-b200-agentic-mtp -n ${NAMESPACE} -w

Alternate topology: disaggregated fallback (1P1D, no MTP)

A disaggregated B200 agentic fallback splits prefill and decode into separate TP4 workers (8x B200 total: 4 prefill + 4 decode) with KV-aware routing plus P/D transfer, and runs without MTP. Its frontend service is ultra-disagg-b200-1p1d-agentic-nomtp-frontend; benchmark it by retargeting the same perf Job at that endpoint with the agentic trace at concurrency 32:

$ kubectl apply -f recipes/nemotron-3-ultra/vllm/disagg-b200-agentic/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd ultra-disagg-b200-1p1d-agentic-nomtp -n ${NAMESPACE} -w

Measured on the 15% agentic trace at concurrency 32: 61.6 user output tok/s and 231.1 system output tok/s/GPU (also listed in the performance table below).

$ kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-h200-chat-mtp/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd ultra-agg-h200-chat-mtp -n ${NAMESPACE} -w

$ kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-h200-agentic-mtp/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd ultra-agg-h200-agentic-mtp -n ${NAMESPACE} -w

Smoke Test

Send a test request to verify the deployment serves traffic:

$ kubectl port-forward svc/ultra-agg-b200-chat-mtp-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/ultra-agg-b200-agentic-mtp-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/ultra-agg-h200-chat-mtp-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/ultra-agg-h200-agentic-mtp-frontend 8000:8000 -n ${NAMESPACE}

$ MODEL_ID=nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
$ 
$ curl http://localhost:8000/v1/models
$ curl http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d "{\"model\":\"${MODEL_ID}\",
>        \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
>        \"max_tokens\":64,
>        \"chat_template_kwargs\":{\"enable_thinking\":false,\"force_nonempty_content\":true}}"

Benchmark

A single AIPerf trace-replay Job (perf/perf.yaml) covers every target — only ENDPOINT, TRACE_FILE, and CONCURRENCY change in its env block. First stage the bundled Moontrace files from recipes/nemotron-3-ultra/perf/traces/ onto the shared-model-cache PVC:

$ kubectl run pvc-helper -n ${NAMESPACE} \
>   --image=busybox:1.36 --restart=Never \
>   --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/opt/models"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"shared-model-cache"}}]}}' \
>   --command -- sleep 3600
$ 
$ kubectl exec -n ${NAMESPACE} pvc-helper -- mkdir -p /opt/models/traces
$ kubectl cp recipes/nemotron-3-ultra/perf/traces/. ${NAMESPACE}/pvc-helper:/opt/models/traces/

Set ENDPOINT to ultra-agg-b200-chat-mtp-frontend:8000 (the Job default) with the chat trace at concurrency 18, then apply. The Job wraps this AIPerf raw Moontrace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer-trust-remote-code \
>   --input-file /opt/models/traces/nim_turbo_8k_1k_70kv_chat_new_noschedule_short_15perc.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://ultra-agg-b200-chat-mtp-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 18 --random-seed 42

Set ENDPOINT to ultra-agg-b200-agentic-mtp-frontend:8000 with the agentic trace at concurrency 20, then apply. The Job wraps this AIPerf raw Moontrace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer-trust-remote-code \
>   --input-file /opt/models/traces/nim_turbo_64k_400_90kv_agent_new_noschedule_short_15perc.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://ultra-agg-b200-agentic-mtp-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 20 --random-seed 42

For the disaggregated fallback, point ENDPOINT at ultra-disagg-b200-1p1d-agentic-nomtp-frontend:8000 with the same agentic trace at concurrency 32.

Set ENDPOINT to ultra-agg-h200-chat-mtp-frontend:8000 with the chat trace at concurrency 10, then apply. The Job wraps this AIPerf raw Moontrace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer-trust-remote-code \
>   --input-file /opt/models/traces/nim_turbo_8k_1k_70kv_chat_new_noschedule_short_15perc.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://ultra-agg-h200-chat-mtp-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 10 --random-seed 42

Set ENDPOINT to ultra-agg-h200-agentic-mtp-frontend:8000 with the agentic trace at concurrency 8, then apply. The Job wraps this AIPerf raw Moontrace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
>   --tokenizer-trust-remote-code \
>   --input-file /opt/models/traces/nim_turbo_64k_400_90kv_agent_new_noschedule_short_15perc.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://ultra-agg-h200-agentic-mtp-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 8 --random-seed 42

$ kubectl apply -f recipes/nemotron-3-ultra/perf/perf.yaml -n ${NAMESPACE}
$ kubectl logs -n ${NAMESPACE} -l job-name=ultra-bench -f
$ kubectl wait --for=condition=Complete job/ultra-bench -n ${NAMESPACE} --timeout=7200s

Artifacts land on the PVC under /opt/models/perf/<epoch>_ultra-bench/. 15% and 30% prefix-slice traces are provided for shorter runs. For concurrency sweeps, delete the worker pods between runs so residual KV/prefix-cache state does not skew results — see the benchmark README for the full workflow, artifact layout, and tunable environment variables.

Expected Performance

Each target is tuned for its workload shape:

Workload	Median ISL	Median OSL	KV cache hit rate
Chat	8K	1K	70%
Agentic	64K	400	90%

B200 rows use 15% raw Moontrace replay with raw_direct_no_filter trace semantics; H200 rows use 300-sample replay evidence. User output tok/s is Gen TPS/user p50 from AIPerf; System output tok/s/GPU is TPS/GPU. Your selected target’s rows are highlighted:

Recipe	GPU	Topology	Workload	MTP	Concurrency	User output tok/s	System output tok/s/GPU
`vllm/agg-b200-chat-mtp/deploy.yaml`	B200	AGG	chat	yes	18	52.0	201.4
`vllm/agg-b200-chat-nomtp/deploy.yaml`	B200	AGG	chat	no	16	51.0	181.3
`vllm/agg-b200-agentic-mtp/deploy.yaml`	B200	AGG	agentic	yes	20	80.6	310.8
`vllm/agg-b200-agentic-nomtp/deploy.yaml`	B200	AGG	agentic	no	8	99.5	175.9
`vllm/disagg-b200-agentic/deploy.yaml`	B200	1P1D	agentic	no	32	61.6	231.1
`vllm/agg-h200-chat-mtp/deploy.yaml`	H200	AGG	chat	yes	10	58.7	46.8
`vllm/agg-h200-chat-nomtp/deploy.yaml`	H200	AGG	chat	no	8	54.2	43.0
`vllm/agg-h200-agentic-mtp/deploy.yaml`	H200	AGG	agentic	yes	8	53.2	27.4
`vllm/agg-h200-agentic-nomtp/deploy.yaml`	H200	AGG	agentic	no	8	52.3	26.5

Treat each row together with its matching recipe, image, trace, and server-shape artifacts.

Compare All Targets

All targets serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 on the dedicated dev runtime image vllm-runtime:1.3.0-nemotron-ultra-dev.1 with KV-aware routing and a 262144 max model length:

	B200 chat	H200 chat	B200 agentic	H200 agentic	B200 disagg agentic
GPUs	4x B200	8x H200	4x B200	8x H200	4x B200 prefill + 4x B200 decode
Mode	aggregated	aggregated	aggregated	aggregated	disaggregated 1P1D
Parallelism	TP4 + EP	TP8 + EP	TP4 + EP	TP8 + EP	TP4 prefill + TP4 decode
Spec decode	MTP, 1 token	MTP, 1 token	MTP, 1 token	MTP, 1 token	no MTP
Max sequences	64	16	24	32	32
Reference concurrency	18	10	20	8	32
Workload	Chat trace	Chat trace	Agentic trace	Agentic trace	Agentic trace

Notes

This is a Day-0 recipe on a dedicated dev runtime image (vllm-runtime:1.3.0-nemotron-ultra-dev.1); it is functional and benchmarked but not yet promoted to a release runtime image.
The recipes pin VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel and pass --no-enable-flashinfer-autotune on vLLM workers. Do not remove these unless rerunning the benchmark qualification — they select the non-FlashInfer FP8 linear kernel path and avoid a measured vLLM 0.22 FlashInfer FP8 regression.
No-MTP fallback manifests are included for every aggregated target at vllm/agg-<sku>-<usecase>-nomtp/deploy.yaml; their DGD names carry the -nomtp suffix, and their measured rows appear in the performance table above.
Reasoning is controlled per request via chat_template_kwargs (enable_thinking, force_nonempty_content) and nvext.max_thinking_tokens. Do not send force_nonempty_content as a top-level request parameter. Top-level reasoning controls such as include_reasoning and reasoning_effort are part of shared Dynamo API compatibility work, not Ultra-specific failures.
Raw Moontrace replay may contain over-context or pathological long-generation rows. Preserve them as HTTP/error evidence rather than dropping them silently.
Tool calling uses the qwen3_coder parser; reasoning parsing uses the model-local ultra_v3_reasoning_parser.py (validated by the model-validate Job).

Source

Source README: recipes/nemotron-3-ultra/README.md
Benchmark workflow: recipes/nemotron-3-ultra/perf/README.md and perf.yaml
B200 chat + MTP: vllm/agg-b200-chat-mtp/deploy.yaml
B200 agentic + MTP: vllm/agg-b200-agentic-mtp/deploy.yaml
H200 chat + MTP: vllm/agg-h200-chat-mtp/deploy.yaml
H200 agentic + MTP: vllm/agg-h200-agentic-mtp/deploy.yaml
B200 disaggregated agentic: vllm/disagg-b200-agentic/deploy.yaml
No-MTP fallbacks: vllm/ (agg-*-nomtp/deploy.yaml)
Model cache setup: model-cache/ (model-cache.yaml, model-download.yaml, model-validate.yaml)