Deployment options
Use this page to compare how you run NeMo Retriever — including when to use NVIDIA-hosted NIMs versus self-hosting on your own infrastructure.
Compare deployment options
Use the sections below to pick documentation and deployment options that match your goal.
I want to run locally or embed the library
- Pre-Requisites & Support Matrix
- Use the Python API or Use the CLI — install and run the
nemo_retrieverpackage in your environment
I want a Kubernetes / Helm deployment
- Pre-Requisites & Support Matrix
- NeMo Retriever Helm chart (supported): Deploy (Helm chart) — sources in
nemo_retriever/helmon GitHub - Published Library Helm charts (supported): cluster install and upgrade procedures are covered in the NeMo Retriever Library — use alongside the NeMo Retriever chart README for your release
- Environment variables and Troubleshoot as needed
Core NIMs for the default extraction pipeline (26.05): page_elements, table_structure, ocr, and vlm_embed (llama-nemotron-embed-vl-1b-v2:1.12.0). These four are auto-wired into the retriever service. Nemotron Parse, Nemotron 3 Nano Omni, the VL reranker, and Parakeet ASR are optional and not auto-wired. For a minimal GPU footprint, disable optional keys you do not need (see Recommended minimal install (26.05)). See Pre-Requisites & Support Matrix — Default Helm NIMs.
Docker Compose (unsupported, developer-only): Docker Compose for local development — not a substitute for Helm or the published Library charts.
For audio and video extraction in Kubernetes, set service.installFfmpeg=true
so the service container installs ffmpeg and ffprobe at startup. This
runtime install requires package-repository network egress, a writable root
filesystem, and security policy that allows the image's scoped sudo use. If
your cluster blocks startup package installation (for example air-gapped
environments), use a custom service image that already contains ffmpeg and
ffprobe, then set service.image.repository and service.image.tag.
I want examples and notebooks
I need API details and keys
- Get your API key
- API reference — PDF pre-splitting if applicable
I am tuning performance or cost
When to use NVIDIA-hosted NIMs
NVIDIA-hosted NIMs run inference on NVIDIA-managed infrastructure. You call models with API keys (refer to Get your API key) without operating GPU nodes yourself.
Consider hosted NIMs when:
- You want the fastest path to try models and iterate without installing drivers, containers, or the NIM Operator on your own clusters.
- Latency to NVIDIA endpoints works for your region and use case.
- Your compliance and data policies allow document or query content in the hosted service (confirm with your security review).
Also refer to: NVIDIA NIM catalog
When to self-host NIMs
Self-hosted NIMs run on your GPUs or air-gapped hardware, typically with Kubernetes and the NIM Operator.
Consider self-hosting when:
- You need an air gap, strict data residency, or customer data must not leave your network.
- You run at large scale where dedicated capacity can cost less than hosted API usage.
- You must meet latency or locality requirements that hosted regions cannot satisfy.
GPU sharing. The NIM Operator supports time-slicing and MIG so multiple NIM workloads can share GPUs. A NIM used with NeMo Retriever Library does not always need a full dedicated GPU when the operator and GPU profile are set correctly. For scheduling and GPU partitioning, refer to the NIM Operator documentation.
Air-gapped and disconnected deployment
The default document extraction pipeline (page elements, table structure, OCR, and VL embed) runs disconnected when you mirror images and models into a private registry and configure the NIM Operator for air-gapped environments.
On a staging host with internet access, pull from NGC, retag to your private registry, stage chart archives, then install in the enclave with registry overrides. Procedures, the 26.05 image inventory, and Helm value patterns are in Helm — Air-gapped deployment.
Audio and video extraction
Audio and video need ffmpeg and ffprobe on PATH. The bundled image omits them. Do not use service.installFfmpeg=true in an air gap (startup install needs package-repo egress). Build a custom service image on a connected staging host, mirror it, and set service.image.repository / service.image.tag. Skip this step if you do not use audio/video.
For offline image captioning, deploy the in-cluster Nemotron 3 Nano Omni NIM and point your pipeline caption endpoint at the in-cluster HTTP URL instead of integrate.api.nvidia.com or other hosted APIs.
Related
- Deploy (Helm chart) (
nemo_retriever/helmon GitHub) — air-gapped deployment - NeMo Retriever Library — prerequisites / deployment (supported Helm handoff)
- Pre-Requisites & Support Matrix
- Audio and video
- Docker Compose (unsupported): docker.md — local developer tooling only