NVIDIA NIM Operator#

About the Operator#

The NVIDIA NIM Operator enables Kubernetes cluster administrators to operate the software components and services necessary to deploy NVIDIA NIMs and NVIDIA NeMo microservices in Kubernetes.

NIM microservices deliver AI foundational models as accelerated inference microservices that are portable across data center, workstation, and cloud environments, accelerating flexible generative models. The NIM Operator supports deploying NIMs in various domains such as reasoning, retrieval, speech, and biology.

NeMo microservices allow you to customize guardrail models.

The Operator can manage the lifecycle of the following microservices and the models they use:

NVIDIA NIM models, such as:
- Reasoning LLMs
- Retrieval — embedding, reranking, and other functions
- Speech
- Biology
NeMo core microservices:
- NeMo Customizer
- NeMo Evaluator
- NeMo Guardrails
NeMo platform component microservices:
- NeMo Data Store
- NeMo Entity Store

Benefits of Using the Operator#

NIM and NeMo microservices are not typically deployed on their own. Instead, they are deployed and used together, along with other third-party dependencies, to complete AI workflows. The deployment and lifecycle management of these microservices and their dependencies for production generative AI pipelines can lead to additional toil for machine learning engineers, LLM engineers, and Kubernetes cluster administrators.

Using the NIM Operator simplifies the operation and lifecycle management of NIM and NeMo microservices at scale and at the cluster level by providing custom resources that enable you to define your NIM or NeMo deployment requirements to cache models to your cluster, enable autoscaling of NIM microservices, and leverage the Kubernetes operator approach for lifecycle management of your AI inference pipelines.

Model Caching#

One key benefit of using the NIM Operator is its ability to pre-cache models and datasets. Models, their many available profiles, and training datasets are large and can take a long time to download. This adds initial inference service latency into your workflows. The NIM Operator allows you to select the models you need for your AI workflows by specifying NIM profiles and tags, or letting the Operator auto-detect the best model based on the GPUs available on the Kubernetes cluster in a cache custom resource. You can precache models on any available node based on your requirements, either on CPU-only or on GPU-accelerated nodes, enabling you to avoid long bootstrapping times and have your models ready when you need them.

Deploying models into a cache will also unblock deploying into air-gapped environments by providing a place to store your models within your clusters.

Logging and Monitoring#

You can collect and visualize latency and throughput metrics to optimize model performance at the NIM Operator level.
This includes metrics for how many caches are deployed across different namespaces and how many NeMo services are deployed across different namespaces. Using these metrics along with the NIM Operator aggregated logs from model services makes it easier to see the health of the services deployed on your cluster, to debug issues on your cluster, or perform regular audits of deployed services.

Autoscaling#

The NIM Operator can also help to solve another common problem with inferencing services: autoscaling. You can scale based on different metrics, either based on DCG metrics or NIM-specific metrics for the service handling requests to your cached models.

Government Ready#

NVIDIA NIM Operator is government ready, NVIDIA’s designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case. NVIDIA AI Enterpise customers can start using this feature by deploying government ready NIM Operator, NVIDIA GPU Operator, and NIM components. For more information on NVIDIA’s government ready support, refer to the white paper AI Software for Regulated Environments.

Deploy Dynamo CRDs (Experimental)#

The NIM Operator experimentally supports deploying Dynamo CRDs using the NIM Operator. This allows you to enable large-scale distributed and disaggregated inference serving using Dynamo. Refer to the Dynamo (Experimental) page for more details.

Kata Containers (Experimental)#

Experimental support for deploying NIMServices in Kata Containers. This feature leverages the GPU Operator and kata-deploy to enable lightweight virtualized isolation for enhanced security.

Refer to Kata Sandbox Workloads (Experimental) for more details.

NIM Operator Custom Resources#

The Operator uses the following custom resources:

nimcaches.apps.nvidia.com

This custom resource enables downloading models from NVIDIA NGC and persisting them on network storage. When multiple instances of the same NIM microservice start, they can use a single cached model that provides performance benefits. Caching is optional. Without caching, each NIM microservice instance downloads its own copy of the model when it starts.
nimservices.apps.nvidia.com

This custom resource represents a NIM microservice. Adding and updating a NIM service resource creates a Kubernetes deployment for the microservice in a namespace.

The custom resource supports using a model from an existing NIM cache resource or a persistent volume claim.

The custom resource also supports creating a horizontal pod autoscaler, ingress, and service monitor to simplify cluster administration.
nimpipelines.apps.nvidia.com

This custom resource represents a group of NIM service custom resources.
nimbuild.apps.nvidia.com

This custom resource generates optimized TensorRT-LLM (TRT-LLM) engine builds based on predefined model profiles within a given LLM NIM. Because TRT-LLM builds are compute and memory intensive, this custom resource allows users to prebuild required engines on desired GPUs. This approach improves startup times and reduces overall resource usage during final NIM deployments and autoscaling, making deployments more predictable.
nemodatastore.apps.nvidia.com, nemoentitystore.apps.nvidia.com, nemocustomizer.apps.nvidia.com, nemoevaluator.apps.nvidia.com, nemoguardrails.apps.nvidia.com

These microservices represent NeMo Platform components that provide a flexible foundation for building AI workflows on your Kubernetes cluster on-premises or in the cloud.

Sample Applications#

NVIDIA provides the following sample applications and tutorials to help you explore the NIM Operator and supported workflows.

Data Flywheel with Jupyter Notebook

Licenses#

The following table identifies the licenses for the software components related to the Operator.

Component	Artifact Type	Artifact Licenses	Source Code License
NVIDIA NIM Operator	Helm Chart	NVIDIA AI Enterprise Software License Agreement	Apache 2
NVIDIA NIM Operator	Container	NVIDIA AI Enterprise Software License Agreement	Apache 2
NVIDIA NIM	Container	NVIDIA AI Enterprise Software License Agreement	None
NVIDIA NeMo Retriever Text Embedding NIM	Container	NVIDIA AI Enterprise Software License Agreement	None
NVIDIA NeMo Data Store	Container	NVIDIA AI Enterprise Software License Agreement	None
NVIDIA NeMo Entity Store	Container	NVIDIA AI Enterprise Software License Agreement	None
NVIDIA NeMo Guardrail	Container	NVIDIA AI Enterprise Software License Agreement	None
NVIDIA NeMo Evaluator	Container	NVIDIA AI Enterprise Software License Agreement	None
NVIDIA NeMo Customizer	Container	NVIDIA AI Enterprise Software License Agreement	None

Third-Party Software#

The Chain Server that you can deploy with the sample pipeline uses third-party software. You can download the Third-Party Licenses.