Nemotron 3 Super β€” Advanced Deployment Guide#

Architecture Considerations#

Three properties of Nemotron 3 Super that directly affect inference configuration:

LatentMoE β€” Expert computation happens in a compressed latent dimension (d=4096 β†’ β„“=1024). All-to-all routing traffic is reduced ~4Γ— vs a standard MoE, which matters significantly for EP across NVLinks. Expert parallelism (--enable-expert-parallel / --ep) is strongly preferred over pure TP for this architecture.

MTP (Multi-Token Prediction) β€” One MTP layer is baked into the checkpoint. This layer functions as a tail augmented draft model (similar to Eagle or other MTP heads) for speculative decoding. Unlike external draft models, additional KV cache and latency overhead is minimal as there is only a single layer called per predicted token.

Mamba-2 Hybrid β€” SSM state cache (mamba_ssm_cache) is distinct from the KV cache. Use float32 for all checkpoint precisions.

Pinned Versions#

Framework

Pinned Version / Image

vLLM

0.17.1

SGLang

lmsysorg/sglang:v0.5.9

TRT-LLM

nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5

vLLM#

Install#

pip install vllm==0.17.1

Baseline Serve Command (4Γ— GB200, 8k/64k)#

VLLM_FLASHINFER_MOE_BACKEND=latency \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_USE_FLASHINFER_MOE_FP8=1 \
vllm serve $MODEL_CKPT \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --enable-expert-parallel \
  --max-cudagraph-capture-size 512 \
  --reasoning-parser-plugin super_v3_reasoning_parser.py \
  --reasoning-parser super_v3

Env Vars#

Variable

Value

Effect

VLLM_FLASHINFER_MOE_BACKEND

latency

TRT-LLM Gen kernels β€” optimal for online/latency-bound serving. Use throughput (CUTLASS) for offline batch jobs.

VLLM_USE_FLASHINFER_MOE_FP4

1

Enables NVFP4 MoE kernels. Blackwell only.

VLLM_USE_FLASHINFER_MOE_FP8

1

Enables FP8 MoE kernels.

VLLM_FLASHINFER_ALLREDUCE_BACKEND

trtllm

Fixes allreduce on certain topologies. Fixed upstream in #35793.

MTP (Speculative Decoding)#

Add to the base command:

  --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}'

Optional Flags Reference#

# Triton attention backend β€” required on some configurations. Fixed upstream in vllm#35219.
--attention-backend TRITON_ATTN

# Reduce CUDA graph memory if headroom is tight (default 512)
--max-cudagraph-capture-size 256

# For unconstrained memory: prefer chunked prefill off
--no-enable-chunked-prefill        # required for vLLM <= 0.15.0 due to accuracy bug

# SSM cache precision β€” float32 for all checkpoint precisions
--mamba-ssm-cache-dtype float32

# Cap context length for fair benchmarking or memory control
--max-model-len 65536

SGLang#

Docker Pull#

docker pull lmsysorg/sglang:v0.5.9

Baseline Serve Command (4Γ— GB200, 8k/64k)#

python3 -m sglang.launch_server \
  --model nvidia/NVIDIA-Nemotron-3-Super \
  --trust-remote-code \
  --tp 4 \
  --ep 4 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nano_v3 \
  --chunked-prefill-size 8192

MTP (Speculative Decoding)#

  --speculative-algorithm EAGLE \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 5

NVFP4/FP8 + MTP on v0.5.9: MTP does not work with NVFP4/FP8 on the v0.5.9 image. The fix is merged into SGLang main. Use the nightly image nightly-dev-20260310-0fd9a57d and add --disable-radix-cache.

SSM Cache#

--mamba-ssm-dtype float32    # use for all checkpoint precisions

Not required if baked into the checkpoint (which it is for released Nemotron 3 Super checkpoints).


TensorRT-LLM#

Docker Pull#

Requires branch build: These configs depend on changes not yet merged into the 1.3.0rc7 release image. Build TRT-LLM from main branch before using these configs.

TRT-LLM requires an extra_llm_api_options YAML for MoE backend, KV cache, and CUDA graph settings that can’t be passed as CLI flags.

Config A β€” NVFP4, 2Γ— B200 (TEP2, Latency-Optimized)#

Optimal for a 2-GPU B200 node running NVFP4 with MTP enabled.

y.yaml

trust_remote_code: true
kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.8
  mamba_ssm_cache_dtype: float32
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
  allow_advanced_sampling: true
cuda_graph_config:
  max_batch_size: 16
  enable_padding: true
moe_config:
  backend: TRTLLM
stream_interval: 1
enable_chunked_prefill: true

Serve command

mpirun -n 1 --allow-run-as-root --oversubscribe \
  trtllm-serve /data/super_fp4/ \
    --host 0.0.0.0 \
    --port 8000 \
    --max_batch_size 16 \
    --tp_size 2 \
    --ep_size 2 \
    --max_num_tokens 8192 \
    --max_seq_len 262144 \
    --extra_llm_api_options y.yaml

Config rationale

  • max_batch_size: 16 β€” Conservative for 2 GPUs. Balances MTP draft acceptance overhead vs. throughput.

  • tp_size 2 / ep_size 2 β€” Full EP across both GPUs. On LatentMoE, EP reduces all-to-all by ~4Γ— vs TP at the same GPU count.

  • mamba_ssm_cache_dtype: float32 β€” Use float32 for all checkpoint precisions.

  • enable_block_reuse: false β€” Mamba recurrent state is not prefix-cacheable; block reuse has no benefit here.

  • num_nextn_predict_layers: 3 β€” Drives MTP with 3 speculative draft steps. Average acceptance length is 3.45 on SPEED-Bench at draft length 7.

  • allow_advanced_sampling: true β€” Required for MTP sampler compatibility.

  • enable_chunked_prefill: true β€” Reduces inter-token latency on long prompts by interleaving prefill and decode steps.

Config B β€” NVFP4, 8Γ— B200 (DEP8, Throughput-Optimized)#

Optimal for a full 8-GPU B200 node (DGX B200) serving NVFP4.

y.yaml

kv_cache_config:
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.8
moe_config:
   backend: TRTLLM
cuda_graph_config:
    enable_padding: true
    max_batch_size: 256
enable_attention_dp: true
num_postprocess_workers: 4
enable_chunked_prefill: true
stream_interval: 10

Additionally, TRT-LLM supports using a quantized mamba cache with stochastic rounding to improve throughput. Extend the kv_cache_config with the following info.

kv_cache_config:
  mamba_ssm_cache_dtype: float16
  mamba_ssm_stochastic_rounding: true
  mamba_ssm_philox_rounds: 5

Serve command

mpirun -n 1 --allow-run-as-root --oversubscribe \
  trtllm-serve /data/super_fp4/ \
  --host 0.0.0.0 \
  --port 8000 \
  --max_batch_size 256 \
  --tp_size 8 --ep_size 8 \
  --max_num_tokens 8192 \
  --trust_remote_code \
  --reasoning_parser nano-v3 \
  --tool_parser qwen3_coder \
  --extra_llm_api_options y.yaml

Config C β€” NVFP4, DGX Spark#

Config to deploy the model on 1x DGX Spark.

kv_cache_config:
  enable_block_reuse: false
cuda_graph_config:
  max_batch_size: 32
  enable_padding: true
moe_config:
  backend: CUTLASS
EOF

Serve command

trtllm-serve /data/super_fp4/ \
  --host 0.0.0.0 \
  --port 8000 \
  --max_batch_size 4 \
  --trust_remote_code \
  --reasoning_parser nano-v3 \
  --tool_parser qwen3_coder \
  --extra_llm_api_options y.yaml

Updated reasoning parser#

To use the force_nonempty_content kwarg in the chat template, build TRT-LLM from main. Alternatively, the changes from PR-12061 can be manually cherry-picked into the release container to enable it.

Contributors:#

The configurations in this document were created by:

Izzy Putterman, Nave Assaf, Joyjit Daw, and many other talented NVIDIA engineers.