Nemotron-3-Ultra

Serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 with Dynamo and vLLM, tuned per GPU and workload.

View as Markdown

Each target below is a validated aggregated vLLM deployment of Nemotron-3-Ultra — NVIDIA’s ~550B hybrid Mamba/Attention/MoE model (~55B active) — with MTP speculative decoding (1 token) and KV-aware routing; the B200 agentic target measured 310.8 system output tok/s per GPU on its trace. Pick your GPU and workload; every command on this page updates to match.

Choose your deployment target

GPU
Workload
Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 4x B200 per worker, TP4 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Chat 8K/1K Moontrace, 70% KV reuse
Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 4x B200 per worker, TP4 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Agentic 64K/400 Moontrace, 90% KV reuse
Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 8x H200 per worker, TP8 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Chat 8K/1K Moontrace, 70% KV reuse
Checkpoint nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4Precision NVFP4 + FP8GPUs 8x H200 per worker, TP8 + EPSpec decode MTP, 1 tokenRouting KV-awareWorkload Agentic 64K/400 Moontrace, 90% KV reuse

Prerequisites

  • A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs served) and 4x B200 per aggregated worker available (8x B200 for the disaggregated fallback).
  • An NGC image pull secret named nvcr-secret for nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-nemotron-ultra-dev.1.
  • A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4.
  • A shared-model-cache PVC containing the tokenizer-patched Ultra model view, or permission to create and populate it with the manifests in model-cache/ (~1200 Gi).
  • A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs served) and 8x H200 per worker available.
  • An NGC image pull secret named nvcr-secret for nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-nemotron-ultra-dev.1.
  • A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4.
  • A shared-model-cache PVC containing the tokenizer-patched Ultra model view, or permission to create and populate it with the manifests in model-cache/ (~1200 Gi).

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="$HF_TOKEN" \
> -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, and cluster-specific placement in the manifests before applying them.

Deploy

Create and populate the model cache, then validate the patched model view before deploying any server:

$# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-cache.yaml -n ${NAMESPACE}
$
$# 2. Download the checkpoint and build the tokenizer-patched model view.
$kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/nemotron-ultra-model-download -n ${NAMESPACE} --timeout=12h
$
$# 3. Validate the patched model view (model/tokenizer/parser files).
$kubectl apply -f recipes/nemotron-3-ultra/model-cache/model-validate.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/nemotron-ultra-model-validate -n ${NAMESPACE} --timeout=30m

Then deploy:

$kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-b200-chat-mtp/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd ultra-agg-b200-chat-mtp -n ${NAMESPACE} -w
$kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-b200-agentic-mtp/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd ultra-agg-b200-agentic-mtp -n ${NAMESPACE} -w

Alternate topology: disaggregated fallback (1P1D, no MTP)

A disaggregated B200 agentic fallback splits prefill and decode into separate TP4 workers (8x B200 total: 4 prefill + 4 decode) with KV-aware routing plus P/D transfer, and runs without MTP. Its frontend service is ultra-disagg-b200-1p1d-agentic-nomtp-frontend; benchmark it by retargeting the same perf Job at that endpoint with the agentic trace at concurrency 32:

$kubectl apply -f recipes/nemotron-3-ultra/vllm/disagg-b200-agentic/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd ultra-disagg-b200-1p1d-agentic-nomtp -n ${NAMESPACE} -w

Measured on the 15% agentic trace at concurrency 32: 61.6 user output tok/s and 231.1 system output tok/s/GPU (also listed in the performance table below).

$kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-h200-chat-mtp/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd ultra-agg-h200-chat-mtp -n ${NAMESPACE} -w
$kubectl apply -f recipes/nemotron-3-ultra/vllm/agg-h200-agentic-mtp/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd ultra-agg-h200-agentic-mtp -n ${NAMESPACE} -w

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/ultra-agg-b200-chat-mtp-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/ultra-agg-b200-agentic-mtp-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/ultra-agg-h200-chat-mtp-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/ultra-agg-h200-agentic-mtp-frontend 8000:8000 -n ${NAMESPACE}
$MODEL_ID=nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
$
$curl http://localhost:8000/v1/models
$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d "{\"model\":\"${MODEL_ID}\",
> \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
> \"max_tokens\":64,
> \"chat_template_kwargs\":{\"enable_thinking\":false,\"force_nonempty_content\":true}}"

Benchmark

A single AIPerf trace-replay Job (perf/perf.yaml) covers every target — only ENDPOINT, TRACE_FILE, and CONCURRENCY change in its env block. First stage the bundled Moontrace files from recipes/nemotron-3-ultra/perf/traces/ onto the shared-model-cache PVC:

$kubectl run pvc-helper -n ${NAMESPACE} \
> --image=busybox:1.36 --restart=Never \
> --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/opt/models"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"shared-model-cache"}}]}}' \
> --command -- sleep 3600
$
$kubectl exec -n ${NAMESPACE} pvc-helper -- mkdir -p /opt/models/traces
$kubectl cp recipes/nemotron-3-ultra/perf/traces/. ${NAMESPACE}/pvc-helper:/opt/models/traces/

Set ENDPOINT to ultra-agg-b200-chat-mtp-frontend:8000 (the Job default) with the chat trace at concurrency 18, then apply. The Job wraps this AIPerf raw Moontrace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer-trust-remote-code \
> --input-file /opt/models/traces/nim_turbo_8k_1k_70kv_chat_new_noschedule_short_15perc.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://ultra-agg-b200-chat-mtp-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 18 --random-seed 42

Set ENDPOINT to ultra-agg-b200-agentic-mtp-frontend:8000 with the agentic trace at concurrency 20, then apply. The Job wraps this AIPerf raw Moontrace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer-trust-remote-code \
> --input-file /opt/models/traces/nim_turbo_64k_400_90kv_agent_new_noschedule_short_15perc.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://ultra-agg-b200-agentic-mtp-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 20 --random-seed 42

For the disaggregated fallback, point ENDPOINT at ultra-disagg-b200-1p1d-agentic-nomtp-frontend:8000 with the same agentic trace at concurrency 32.

Set ENDPOINT to ultra-agg-h200-chat-mtp-frontend:8000 with the chat trace at concurrency 10, then apply. The Job wraps this AIPerf raw Moontrace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer-trust-remote-code \
> --input-file /opt/models/traces/nim_turbo_8k_1k_70kv_chat_new_noschedule_short_15perc.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://ultra-agg-h200-chat-mtp-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 10 --random-seed 42

Set ENDPOINT to ultra-agg-h200-agentic-mtp-frontend:8000 with the agentic trace at concurrency 8, then apply. The Job wraps this AIPerf raw Moontrace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer /opt/models/patched/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
> --tokenizer-trust-remote-code \
> --input-file /opt/models/traces/nim_turbo_64k_400_90kv_agent_new_noschedule_short_15perc.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://ultra-agg-h200-agentic-mtp-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 8 --random-seed 42
$kubectl apply -f recipes/nemotron-3-ultra/perf/perf.yaml -n ${NAMESPACE}
$kubectl logs -n ${NAMESPACE} -l job-name=ultra-bench -f
$kubectl wait --for=condition=Complete job/ultra-bench -n ${NAMESPACE} --timeout=7200s

Artifacts land on the PVC under /opt/models/perf/<epoch>_ultra-bench/. 15% and 30% prefix-slice traces are provided for shorter runs. For concurrency sweeps, delete the worker pods between runs so residual KV/prefix-cache state does not skew results — see the benchmark README for the full workflow, artifact layout, and tunable environment variables.

Expected Performance

Each target is tuned for its workload shape:

WorkloadMedian ISLMedian OSLKV cache hit rate
Chat8K1K70%
Agentic64K40090%

B200 rows use 15% raw Moontrace replay with raw_direct_no_filter trace semantics; H200 rows use 300-sample replay evidence. User output tok/s is Gen TPS/user p50 from AIPerf; System output tok/s/GPU is TPS/GPU. Your selected target’s rows are highlighted:

RecipeGPUTopologyWorkloadMTPConcurrencyUser output tok/sSystem output tok/s/GPU
vllm/agg-b200-chat-mtp/deploy.yamlB200AGGchatyes1852.0201.4
vllm/agg-b200-chat-nomtp/deploy.yamlB200AGGchatno1651.0181.3
vllm/agg-b200-agentic-mtp/deploy.yamlB200AGGagenticyes2080.6310.8
vllm/agg-b200-agentic-nomtp/deploy.yamlB200AGGagenticno899.5175.9
vllm/disagg-b200-agentic/deploy.yamlB2001P1Dagenticno3261.6231.1
vllm/agg-h200-chat-mtp/deploy.yamlH200AGGchatyes1058.746.8
vllm/agg-h200-chat-nomtp/deploy.yamlH200AGGchatno854.243.0
vllm/agg-h200-agentic-mtp/deploy.yamlH200AGGagenticyes853.227.4
vllm/agg-h200-agentic-nomtp/deploy.yamlH200AGGagenticno852.326.5

Treat each row together with its matching recipe, image, trace, and server-shape artifacts.

Compare All Targets

All targets serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 on the dedicated dev runtime image vllm-runtime:1.3.0-nemotron-ultra-dev.1 with KV-aware routing and a 262144 max model length:

B200 chatH200 chatB200 agenticH200 agenticB200 disagg agentic
GPUs4x B2008x H2004x B2008x H2004x B200 prefill + 4x B200 decode
Modeaggregatedaggregatedaggregatedaggregateddisaggregated 1P1D
ParallelismTP4 + EPTP8 + EPTP4 + EPTP8 + EPTP4 prefill + TP4 decode
Spec decodeMTP, 1 tokenMTP, 1 tokenMTP, 1 tokenMTP, 1 tokenno MTP
Max sequences6416243232
Reference concurrency181020832
WorkloadChat traceChat traceAgentic traceAgentic traceAgentic trace

Notes

  • This is a Day-0 recipe on a dedicated dev runtime image (vllm-runtime:1.3.0-nemotron-ultra-dev.1); it is functional and benchmarked but not yet promoted to a release runtime image.
  • The recipes pin VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel and pass --no-enable-flashinfer-autotune on vLLM workers. Do not remove these unless rerunning the benchmark qualification — they select the non-FlashInfer FP8 linear kernel path and avoid a measured vLLM 0.22 FlashInfer FP8 regression.
  • No-MTP fallback manifests are included for every aggregated target at vllm/agg-<sku>-<usecase>-nomtp/deploy.yaml; their DGD names carry the -nomtp suffix, and their measured rows appear in the performance table above.
  • Reasoning is controlled per request via chat_template_kwargs (enable_thinking, force_nonempty_content) and nvext.max_thinking_tokens. Do not send force_nonempty_content as a top-level request parameter. Top-level reasoning controls such as include_reasoning and reasoning_effort are part of shared Dynamo API compatibility work, not Ultra-specific failures.
  • Raw Moontrace replay may contain over-context or pathological long-generation rows. Preserve them as HTTP/error evidence rather than dropping them silently.
  • Tool calling uses the qwen3_coder parser; reasoning parsing uses the model-local ultra_v3_reasoning_parser.py (validated by the model-validate Job).

Source