Nemotron 3 Nano Omni — Inference & Deployment#
How to deploy and serve Nemotron 3 Nano Omni after training (or against the public release checkpoint). For training instructions see the recipe overview; for architectural background see architecture.md.
TL;DR#
If you want… |
Use |
|---|---|
Drop-in agent endpoint |
NIM at |
Production high-throughput serving |
TensorRT-LLM (Hopper / Blackwell) with NVFP4 |
Continuous batching + streaming |
vLLM |
Multi-agent + tool-calling, lightweight |
SGLang |
Disaggregated serving with intelligent routing |
Dynamo |
Local laptop inference |
Ollama or |
Local desktop UI |
LM Studio |
Managed cloud |
Amazon SageMaker JumpStart, Oracle Cloud, Microsoft Azure (coming soon), Dell on-premises |
Privacy-first video processing |
NemoClaw sandbox (sandboxed video analysis with policy-bounded output) |
Inference engines#
Engine |
Strengths |
Hardware |
Quantization |
|---|---|---|---|
vLLM |
Continuous batching, streaming, broad ecosystem |
Ampere / Hopper / Blackwell |
FP8, NVFP4 |
SGLang |
Lightweight, strong multi-agent + tool-calling story |
Ampere / Hopper / Blackwell |
FP8, NVFP4 |
NVIDIA TensorRT-LLM |
Lowest-latency production serving, latent MoE kernels |
Hopper / Blackwell |
FP8, NVFP4 |
Dynamo |
Disaggregated serving, intelligent routing, multi-tier KV caching |
Hopper / Blackwell |
FP8, NVFP4 |
Pyxis-mounting the trained omni3-sft.sqsh from this repo’s training pipeline produces a container compatible with all four engines — see the engine documentation for serving entry points. For the open release checkpoint, pull nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 from Hugging Face directly and let the engine handle weight loading + (optional) on-the-fly quantization.
Quantization#
The model supports two quantization paths out of the box:
FP8 — works on Hopper and Blackwell; ~2× memory reduction over BF16 with minimal quality loss for this model class
NVFP4 — Blackwell-only; produces the highest reported throughput among open omnimodal models per the release blog
Quantization is applied at the inference-engine level, not as separately published HF checkpoints. All four engines above support both formats. For Blackwell + NVFP4 specifically, the release blog reports the highest throughput of any open omnimodal model in this class.
Hardware#
Family |
Supported |
Notes |
|---|---|---|
Ampere (A100, A40, …) |
✅ |
BF16 / FP8; older GPUs lack NVFP4 hardware |
Hopper (H100, H200) |
✅ |
All paths supported; FP8 most efficient |
Blackwell (B100, B200, GB200) |
✅ |
NVFP4 unlocks peak throughput |
Cloud platforms#
Platform |
Status |
Path |
|---|---|---|
Amazon SageMaker JumpStart |
Available |
Search for |
Oracle Cloud |
Available |
OCI AI services |
Microsoft Azure |
Coming soon (per release blog) |
Azure AI catalog |
Dell Technologies |
Available (on-premises / hybrid) |
Dell AI Factory |
Inference providers#
The model is hosted by:
Baseten, Clarifai, DeepInfra, Fireworks AI, Together AI, Vultr, Bitdeer, Crusoe, DigitalOcean, Lightning AI, Nebius
These provide managed endpoints if you don’t want to run inference yourself.
Local / edge#
Format |
Tool |
Notes |
|---|---|---|
GGUF |
|
Quantized weights for CPU / Apple Silicon / consumer GPU |
GGUF |
Ollama |
Wraps |
Native |
LM Studio |
Desktop UI for local inference |
Inference Snaps |
NVIDIA Inference Snaps |
Pre-packaged endpoints for edge deployment |
These are best for local development, offline analysis, or edge inference where latency and privacy matter more than peak throughput.
NemoClaw — privacy-first video processing#
For workloads where video frames can’t leave a sandboxed environment (compliance, healthcare, classified content), the model integrates with NemoClaw, NVIDIA’s sandboxed video-processing layer. The sandbox runs the perception model, then exports only policy-bounded outputs (transcripts, summaries, structured tags) — raw frames stay inside.
This is a distinguishing capability for the model’s “perception sub-agent” framing: a downstream agent can invoke the omni model on private video content and get usable summaries back without ever handling raw frames itself.
Tuning notes#
Per-user interactivity: the release blog reports throughput numbers at a fixed per-user tokens-per-second floor; aggregate throughput scales by holding user-level interactivity constant and adding concurrency.
EVS configuration: for video-heavy workloads, the EVS layer’s frame-compression ratio is the most important throughput knob (see
architecture.md§Video pipeline for the why). Lower compression → higher quality, lower throughput; tune per workload.MoE expert dispatch: the 30B-A3B router is sensitive to batch composition. For peak throughput, group requests with similar modality mix.
Deployment from this recipe’s training output#
The training pipeline in this repo produces ${build_cache_dir}/containers/omni3-sft.sqsh, a squashfs container that pyxis mounts directly. To use it for inference:
# Verify the squashfs is in place
ls -lh ${build_cache_dir}/containers/omni3-sft.sqsh
# Submit an inference job (example with vLLM)
srun --container-image=${build_cache_dir}/containers/omni3-sft.sqsh \
--container-mounts=/lustre:/lustre,/path/to/checkpoint:/checkpoint \
bash -lc 'vllm serve /checkpoint --tensor-parallel-size 1 --gpu-memory-utilization 0.9'
For benchmark evaluation against a deployed checkpoint, run nemotron omni3 model eval (a dedicated nemotron omni3 eval stage will land in a follow-up release).
References#
Release blog — canonical source for the throughput claims, provider list, and deployment paths
architecture.md— the why behind the inference characteristics