NVIDIA NIM Operator#

About the Operator#

The NVIDIA NIM Operator enables Kubernetes cluster administrators to operate the software components and services necessary to deploy NVIDIA NIMs and NVIDIA NeMo microservices in Kubernetes.

NIM microservices deliver AI foundational models as accelerated inference microservices that are portable across data center, workstation, and cloud environments, accelerating flexible generative models. The NIM Operator supports deploying NIMs in various domains such as reasoning, retrieval, speech, and biology.

NeMo microservices allow you to customize guardrail models.

The Operator can manage the lifecycle of the following microservices and the models they use:

  • NVIDIA NIM models, such as:

    • Reasoning LLMs

    • Retrieval — embedding, reranking, and other functions

    • Speech

    • Biology

  • NeMo core microservices:

    • NeMo Customizer

    • NeMo Evaluator

    • NeMo Guardrails

  • NeMo platform component microservices:

    • NeMo Data Store

    • NeMo Entity Store

Benefits of Using the Operator#

NIM and NeMo microservices are not typically deployed on their own. Instead, they are deployed and used together, along with other third-party dependencies, to complete AI workflows. For instance, multi-turn conversational AI in a RAG pipeline uses the LLM, embedding, and reranking NIM microservices. The deployment and lifecycle management of these microservices and their dependencies for production generative AI pipelines can lead to additional toil for machine learning engineers, LLM engineers, and Kubernetes cluster administrators.

Using the NIM Operator simplifies the operation and lifecycle management of NIM and NeMo microservices at scale and at the cluster level by providing custom resources that enable you to define your NIM or NeMo deployment requirements to cache models to your cluster, enable autoscaling of NIM microservices, and leverage the Kubernetes operator approach for lifecycle management of your AI inference pipelines.

Model Caching#

One key benefit of using the NIM Operator is its ability to pre-cache models and datasets. Models, their many available profiles, and training datasets are large and can take a long time to download. This adds initial inference service latency into your workflows. The NIM Operator allows you to select the models you need for your AI workflows by specifying NIM profiles and tags, or letting the Operator auto-detect the best model based on the GPUs available on the Kubernetes cluster in a cache custom resource. You can precache models on any available node based on your requirements, either on CPU-only or on GPU-accelerated nodes, enabling you to avoid long bootstrapping times and have your models ready when you need them.

Deploying models into a cache will also unblock deploying into air-gapped environments by providing a place to store your models within your clusters.

Logging and Monitoring#

You can collect and visualize latency and throughput metrics to optimize model performance at the NIM Operator level.
This includes metrics for how many caches are deployed across different namespaces and how many NeMo services are deployed across different namespaces. Using these metrics along with the NIM Operator aggregated logs from model services makes it easier to see the health of the services deployed on your cluster, to debug issues on your cluster, or perform regular audits of deployed services.

Autoscaling#

The NIM Operator can also help to solve another common problem with inferencing services: autoscaling. You can scale based on different metrics, either based on DCG metrics or NIM-specific metrics for the service handling requests to your cached models.

NIM Operator Custom Resources#

The Operator uses the following custom resources:

  • nimcaches.apps.nvidia.com

    This custom resource enables downloading models from NVIDIA NGC and persisting them on network storage. When multiple instances of the same NIM microservice start, they can use a single cached model that provides performance benefits. Caching is optional. Without caching, each NIM microservice instance downloads its own copy of the model when it starts.

  • nimservices.apps.nvidia.com

    This custom resource represents a NIM microservice. Adding and updating a NIM service resource creates a Kubernetes deployment for the microservice in a namespace.

    The custom resource supports using a model from an existing NIM cache resource or a persistent volume claim.

    The custom resource also supports creating a horizontal pod autoscaler, ingress, and service monitor to simplify cluster administration.

  • nimpipelines.apps.nvidia.com

    This custom resource represents a group of NIM service custom resources.

  • nimbuild.apps.nvidia.com

    This custom resource generates optimized TensorRT-LLM (TRT-LLM) engine builds based on predefined model profiles within a given LLM NIM. Because TRT-LLM builds are compute and memory intensive, this custom resource allows users to prebuild required engines on desired GPUs. This approach improves startup times and reduces overall resource usage during final NIM deployments and autoscaling, making deployments more predictable.

  • nemodatastore.apps.nvidia.com, nemoentitystore.apps.nvidia.com, nemocustomizer.apps.nvidia.com, nemoevaluator.apps.nvidia.com, nemoguardrails.apps.nvidia.com

    These microservices represent NeMo Platform components that provide a flexible foundation for building AI workflows on your Kubernetes cluster on-premises or in the cloud.

Sample Applications#

NVIDIA provides the following sample applications and tutorials to help you explore the NIM Operator and supported workflows.

Licenses#

The following table identifies the licenses for the software components related to the Operator.

Component

Artifact Type

Artifact Licenses

Source Code License

NVIDIA NIM Operator

Helm Chart

NVIDIA AI Enterprise Software License Agreement

Apache 2

NVIDIA NIM Operator

Container

NVIDIA AI Enterprise Software License Agreement

Apache 2

NVIDIA NIM

Container

NVIDIA AI Enterprise Software License Agreement

None

NVIDIA NeMo Retriever Text Embedding NIM

Container

NVIDIA AI Enterprise Software License Agreement

None

NVIDIA NeMo Data Store

Container

NVIDIA AI Enterprise Software License Agreement

None

NVIDIA NeMo Entity Store

Container

NVIDIA AI Enterprise Software License Agreement

None

NVIDIA NeMo Guardrail

Container

NVIDIA AI Enterprise Software License Agreement

None

NVIDIA NeMo Evaluator

Container

NVIDIA AI Enterprise Software License Agreement

None

NVIDIA NeMo Customizer

Container

NVIDIA AI Enterprise Software License Agreement

None

Third-Party Software#

The Chain Server that you can deploy with the sample pipeline uses third-party software. You can download the Third-Party Licenses.