Managing NIM Services#

About NIM Services#

A NIM service is a Kubernetes custom resource, nimservices.apps.nvidia.com. The NIM Service spec remains the same across non-LLM NIM, LLM-Specific NIM, and Multi-LLM NIM, except the repository name as shown in the following examples. You create and delete NIM service resources to manage NVIDIA NIM microservices.

Refer to the sample manifests below for examples for different deployment scenarios. For full information on available fields, see the NIM Service Configuration Reference.

Non-LLM NIM

# NIMService for Non-LLM
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-rerankqa-mistral-4b-v3
spec:
  image:
    repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3
    tag: 1.0.2
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-rerankqa-mistral-4b-v3
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

LLM-Specific NIM

# NIMService for LLM-specific
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-1-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: "1.3.3"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: llama-3-1-8b-instruct
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Multi-LLM NIM

# NIMService for Multi-LLM NGC NIMCache
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: ngc-nim-service-multi
spec:
  image:
    repository: nvcr.io/nim/nvidia/llm-nim
    tag: "1.11.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: ngc-nim-cache-multi
      profile: tensorrt_llm
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

NIM Service Configuration Reference#

Refer to the following table for information about the available NIM Service fields:

Field	Description	Default Value
`spec.affinity`	Defines rules for pod scheduling based on node or pod characteristics. Supports the following subfields: `nodeAffinity`: Specifies node selection constraints. `podAffinity`: Ensures pods are scheduled on nodes with specific other pods. `podAntiAffinity`: Ensures pods are not scheduled on nodes with specific other pods. Refer to the Kubernetes documentation on affinity and anti-affinity for detailed configuration options.	None
`spec.annotations`	Specifies annotations to set on the Kubernetes resources that are created for a NIMService.	None
`spec.authSecret` (required)	Specifies the name of a generic secret that contains NGC_API_KEY. Learn more about image pull secrets.	None
`spec.args`	Specifies the arguments to pass to the command in NIM Service.	None
`spec.command`	Specifies the command to run in the NIM Service.	None
`spec.draResources`	DRAResources is the list of DRA resource claims to be used for the NIMService deployment or leader worker set. Refer to the NIM Operator DRA guide for more information.	None
`spec.draResources.claimCreationSpec`	ClaimCreationSpec is the spec to auto-generate a DRA resource claim template.	None
`spec.draResources.claimCreationSpec` `.devices.attributeSelectors`	Defines the criteria which must be satisfied by the device attributes of a device.	None
`spec.draResources.claimCreationSpec` `.devices.attributeSelectors.key`	Specifies the name of the device attribute. This is either a qualified name or a simple name. If it is a simple name, then it is assumed to be prefixed with the DRA driver name. For example, “gpu.nvidia.com/productName” is equivalent to “productName” if the driver name is “gpu.nvidia.com”. Otherwise they are treated as 2 different attributes.	None
`spec.draResources.claimCreationSpec` `.devices.attributeSelectors.op`	Specifies the operator to use for comparing the device attribute value. Supported operators are: Equal: The device attribute value must be equal to the value specified in the selector. NotEqual: The device attribute value must not be equal to the value specified in the selector. GreaterThan: The device attribute value must be greater than the value specified in the selector. GreaterThanOrEqual: The device attribute value must be greater than or equal to the value specified in the selector. LessThan: The device attribute value must be less than the value specified in the selector. LessThanOrEqual: The device attribute value must be less than or equal to the value specified in the selector.	`Equal`
`spec.draResources.claimCreationSpec` `.devices.attributeSelectors.value`	Specifies the value to compare against the device attribute.	None
`spec.draResources.claimCreationSpec` `.devices.capacitySelectors`	Defines the criteria that must be satisfied by the device capacity of a device.	None
`spec.draResources.claimCreationSpec` `.devices.capacitySelectors.key`	Specifies the name of the resource. This is either a qualified name or a simple name. If it is a simple name, then it is assumed to be prefixed with the DRA driver name. For example, “gpu.nvidia.com/memory” is equivalent to “memory” if the driver name is “gpu.nvidia.com”. Otherwise they are treated as 2 different attributes.	None
`spec.draResources.claimCreationSpec` `.devices.capacitySelectors.op`	Specifies the operator to use for comparing against the device capacity. Supported operators are: Equal: The resource quantity value must be equal to the value specified in the selector. NotEqual: The resource quantity value must not be equal to the value specified in the selector. GreaterThan: The resource quantity value must be greater than the value specified in the selector. GreaterThanOrEqual: The resource quantity value must be greater than or equal to the value specified in the selector. LessThan: The resource quantity value must be less than the value specified in the selector. LessThanOrEqual: The resource quantity value must be less than or equal to the value specified in the selector.	`Equal`
`spec.draResources.claimCreationSpec` `.devices.capacitySelectors.value`	Specifies the resource quantity to compare against.	None
`spec.draResources.claimCreationSpec` `.devices.celExpressions`	Specifies a list of CEL expressions that must be satisfied by the DRA device.	None
`spec.draResources.claimCreationSpec` `.devices.count`	Specifies the number of devices to request.	`1`
`spec.draResources.claimCreationSpec` `.devices.deviceClassName`	Specifies the DeviceClass to inherit configuration and selectors from.	`gpu.nvidia.com`
`spec.draResources.claimCreationSpec` `.devices.driverName`	Specifies the name of the DRA driver providing the capacity information. Must be a DNS subdomain.	`gpu.nvidia.com`
`spec.draResources.claimCreationSpec` `.devices.name`	Specifies the name of the device request to use in the generated claim spec. Must be a valid DNS_LABEL.	None
`spec.draResources.requests`	Requests is the list of requests in the referenced ResourceClaim/ResourceClaimTemplate to be made available to the model container of the NIMService pods. If empty, everything from the claim is made available, otherwise only the result of this subset of requests.	None
`spec.draResources.resourceClaimName`	ResourceClaimName is the name of a ResourceClaim object in the same namespace as the NIMService. Exactly one of ResourceClaimName and ResourceClaimTemplateName must be set.	None
`spec.draResources.resourceClaimTemplateName`	ResourceClaimTemplateName is the name of a ResourceClaimTemplate object in the same namespace as the pods for this NIMService. The template will be used to create a new ResourceClaim, which will be bound to the pods created for this NIMService. Exactly one of ResourceClaimName and ResourceClaimTemplateName must be set.	None
`spec.env`	Specifies environment variables to set in the NIM microservice container.	None
`spec.initContainers`	List of init containers to run before the main NIM container in the pod. Each item specifies `name`, `image`, and optionally `command`, `args`, `env`, and `workingDir`. Refer to the Kubernetes documentation on init containers for more details.	None
`spec.sidecarContainers`	List of sidecar containers to run alongside the main NIM container in the pod. Each item specifies `name`, `image`, and optionally `command`, `args`, `env`, and `workingDir`. Refer to the Kubernetes documentation on sidecar containers for more details.	None
`spec.expose.ingress.enabled` Deprecated.	Use `spec.expose.router` to define ingress setup instead. When set to `true`, the Operator creates a Kubernetes Ingress resource for the NIM microservice. Specify the ingress specification in the `spec.expose.ingress.spec` field. If you have an ingress controller, values like the following sample configures an ingress for the `v1/chat/completions` endpoint. ingress: enabled: true spec: ingressClassName: nginx rules: - host: demo.nvidia.example.com http: paths: - backend: service: name: meta-llama3-8b-instruct port: number: 8000 path: /v1/chat/completions pathType: Prefix	`false`
`spec.expose.router`	Configures routing for the NIM service using either Ingress or Gateway API resources. Use this field to expose the NIM service through an ingress controller or gateway. The router automatically generates hostnames in the format `<nimServiceName>.<namespace>.<hostDomainName>`. Refer to the router subfields below for configuration options.	None
`spec.expose.router.annotations`	Specifies annotations to add to the router resources (Ingress, HTTPRoute, or GRPCRoute). Common use cases include configuring ingress-specific settings such as SSL redirects, rate limiting, or gateway metadata. Example for NGINX Ingress: annotations: nginx.ingress.kubernetes.io/ssl-redirect: "true" nginx.ingress.kubernetes.io/force-ssl-redirect: "true"	None
`spec.expose.router.gateway`	Configures Gateway API routing for the NIM service. When specified, the Operator creates HTTPRoute and/or GRPCRoute resources that reference the specified gateway. This field cannot be used at the same time as the `spec.expose.router.ingress` field. Before using this field, Gateway API controller must be installed and a Gateway resource created. router: gateway: namespace: nim-service name: istio-gateway httpRoutesEnabled: true hostDomainName: demo.nvidia.example.com Refer to the gateway subfields below for configuration options.	None
`spec.expose.router.gateway.grpcRoutesEnabled`	When set to `true`, the Operator creates a GRPCRoute resource for the NIM service. This field requires the `spec.expose.service.grpcPort` to be specified, the Gateway must support GRPCRoute resources. Only use this field for NIMs running Triton gRPC Inference Server (non-LLM NIMs). expose: service: type: ClusterIP port: 9000 grpcPort: 50051 router: gateway: namespace: nemo name: istio-gateway grpcRoutesEnabled: true hostDomainName: demo.nvidia.example.com	`false`
`spec.expose.router.gateway.httpRoutesEnabled`	When set to `true`, the Operator creates an HTTPRoute resource for the NIM service. The HTTPRoute matches requests with path prefix `/` and routes them to the NIM service. Your gateway must support HTTPRoute resources.	`true`
`spec.expose.router.gateway.name`	Specifies the name of the Gateway resource to attach the routes to. Required when using `spec.expose.router.gateway`. The referenced Gateway must exist in the namespace specified by `spec.expose.router.gateway.namespace`.	None
`spec.expose.router.gateway.namespace`	Specifies the namespace where the Gateway resource is located. Required when using `spec.expose.router.gateway`. The Gateway can be in a different namespace than the NIMService resource.	None
`spec.expose.router.hostDomainName`	Specifies the base domain name of the hostname matched by the router. Required when using `spec.expose.router`. The Operator constructs the full hostname as: `<nimServiceName>.<namespace>.<hostDomainName>` If `hostDomainName: "example.com"`, NIMService name is `llama-3`, and namespace is `nim-service`, the generated hostname will be `llama-3.nim-service.example.com`.	None
`spec.expose.router.ingress`	Configures Ingress-based routing for the NIM service. When specified, the Operator creates a Kubernetes Ingress resource for the NIM microservice. Cannot be used with `spec.expose.router.gateway`. router: ingress: ingressClass: nginx hostDomainName: demo.nvidia.example.com Refer to the ingress subfields below for configuration options.	None
`spec.expose.router.ingress.ingressClass`	Specifies the ingress class name to use for the created Kubernetes Ingress resource. Required when using `spec.expose.router.ingress`.	None
`spec.expose.router.ingress.tlsSecretName`	Specifies the name of a Kubernetes Secret containing TLS certificate and key for HTTPS termination. The secret must be of type `kubernetes.io/tls` and contain `tls.crt` and `tls.key` fields. When specified, the Operator configures the Ingress to terminate TLS for the generated hostname.	None
`spec.expose.service.annotations`	Specifies annotations to add to the Kubernetes Service resource that is created for the NIM microservice.	None
`spec.expose.service.grpcPort`	Specifies the Triton Inference Server gRPC service port number. Only use this port for non-LLM NIM microservice running a Triton gRPC Inference Server. Required when using Gateway API with `spec.expose.router.gateway.grpcRoutesEnabled: true`.	None
`spec.expose.service.metricsPort`	Specifies the Triton Inference Server metrics port number for a non-LLM NIM microservice. Only use this port for non-LLM NIM running a separate Triton Inference Server metrics endpoint.	None
`spec.expose.service.name`	Specifies a custom name for the Kubernetes Service resource. When not specified, the service name defaults to the NIMService resource name. Use this field to override the default service naming convention.	NIMService resource name
`spec.expose.service.port`	Specifies the main API serving port number for the NIM microservice.	`8000`
`spec.expose.service.type`	Specifies the Kubernetes service type to create for the NIM microservice. Valid values include: `ClusterIP`: Exposes the service on an internal cluster IP (default) `NodePort`: Exposes the service on each node’s IP at a static port `LoadBalancer`: Exposes the service externally using a cloud provider’s load balancer Note: When set to `LoadBalancer` and `spec.expose.router` is also configured, a warning is issued because this creates two entry points for the service.	`ClusterIP`
`spec.groupID`	Specifies the group for the pods. This value is used to set the security context of the pod in the `runAsGroup` and `fsGroup` fields.	`2000`
`spec.image`	Specifies repository, tag, pull policy, and pull secret for the container image.	None
`spec.inferencePlatform`	Specifies the inference platform to use for deploying the NIM service. Valid values include `standalone` (default) and `kserve`. Use `standalone` to use traditional Kubernetes Deployment resources If you are deploying through KServe, set this field to `kserve`. For more information about KServe integration, refer to KServe Support on NIM Operator.	`standalone`
`spec.labels`	Specifies the user-supplied labels to add to the pod.	None
`spec.livenessProbe`	Specifies the liveness probe configuration for the NIM Service. Liveness probes determine when to restart a container. By default, the Operator configures an HTTP GET probe to `/v1/health/live` with the following settings: Initial delay: 15 seconds Timeout: 1 second Period: 10 seconds Failure threshold: 3 To customize the probe, set `spec.livenessProbe.probe` with a Kubernetes probe specification. To disable the probe entirely, set `spec.livenessProbe.enabled: false`. Refer to the Kubernetes documentation on probes for detailed configuration options.	Enabled by default with HTTP GET to `/v1/health/live`
`spec.readinessProbe`	Specifies the readiness probe configuration for the NIM Service. Readiness probes determine when a container is ready to accept traffic. By default, the Operator configures an HTTP GET probe to `/v1/health/ready` with the following settings: Initial delay: 15 seconds Timeout: 1 second Period: 10 seconds Failure threshold: 3 To customize the probe, set `spec.readinessProbe.probe` with a Kubernetes probe specification. To disable the probe entirely, set `spec.readinessProbe.enabled: false`. Refer to the Kubernetes documentation on probes for detailed configuration options.	Enabled by default with HTTP GET to `/v1/health/ready`
`spec.startupProbe`	Specifies the startup probe configuration for the NIM Service. Startup probes indicate when the application has started. All other probes are disabled until the startup probe succeeds. By default, the Operator configures an HTTP GET probe to `/v1/health/ready` with the following settings: Initial delay: 30 seconds Timeout: 1 second Period: 10 seconds Failure threshold: 120 (allows up to 20 minutes for startup) To customize the probe, set `spec.startupProbe.probe` with a Kubernetes probe specification. To disable the probe entirely, set `spec.startupProbe.enabled: false`. Refer to the Kubernetes documentation on probes for detailed configuration options.	Enabled by default with HTTP GET to `/v1/health/ready`
`spec.metrics.enabled`	When set to `true`, the Operator configures a Prometheus service monitor for the service. Specify the service monitor specification in the `spec.metrics.serviceMonitor` field.	`false`
`spec.metrics.serviceMonitor`	Specifies the configuration for the Prometheus ServiceMonitor resource. The ServiceMonitor is a custom resource provided by the Prometheus Operator that specifies how to scrape metrics from the NIM service. Refer to NVIDIA NIM Operator Observability for more information on configuring metrics collection.	None
`spec.metrics.serviceMonitor.additionalLabels`	Specifies additional labels to add to the ServiceMonitor resource. These labels are commonly used to help Prometheus discover the ServiceMonitor. For example, when using the kube-prometheus-stack Helm chart, set `release: kube-prometheus-stack` to enable discovery: serviceMonitor: additionalLabels: release: kube-prometheus-stack	None
`spec.metrics.serviceMonitor.annotations`	Specifies annotations to add to the ServiceMonitor resource.	None
`spec.metrics.serviceMonitor.interval`	Specifies the interval at which Prometheus scrapes metrics from the service. Use duration format such as `30s`, `1m`, or `5m`.	Prometheus default (typically `30s`)
`spec.metrics.serviceMonitor.scrapeTimeout`	Specifies the timeout duration for a scrape request. Use duration format such as `10s`, `30s`, or `1m`. The timeout should be shorter than the scrape interval.	Prometheus default (typically `10s`)
`spec.multiNode`	NimServiceMultiNodeConfig defines the configuration for multi-node NIMService. Refer to Multi-Node NIM Deployment for more information.	None
`spec.multiNode.backendType`	BackendType specifies the backend type for the multi-node NIMService. Currently only LWS is supported.	lws
`spec.multiNode.parallelism` (required)	Specifies the parallelism strategy for multi-node NIM deployments. This field is required when deploying a NIM across multiple nodes. Configure the following subfields to define the parallelism topology: `pipeline`: Pipeline parallelism size (number of pipeline stages) `tensor`: Tensor parallelism size (number of GPUs per pipeline stage) Refer to Multi-Node NIM Deployment for detailed configuration examples.	None
`spec.multiNode.parallelism.pipeline`	Specifies the pipeline parallelism size for the multi-node NIM deployment. Pipeline parallelism splits the model across multiple pipeline stages, where each stage processes different layers of the model. This determines how many pipeline stages the model is split into.	None
`spec.multiNode.parallelism.tensor`	Specifies the tensor parallelism size for the multi-node NIM deployment. Tensor parallelism splits individual model layers across multiple GPUs within each pipeline stage. This determines how many GPUs are used per pipeline stage.	None
`spec.multiNode.mpi`	MPI config for NIMService using LeaderWorkerSet	None
`spec.multiNode.mpi.mpiStartTimeout`	Specifies the timeout in seconds for starting the MPI cluster. The Operator waits up to this duration for all nodes in the multi-node cluster to initialize and establish communication via MPI before failing the deployment.	`300`
`spec.nodeSelector`	Specifies node selector labels to schedule the service.	None
`spec.proxy.certConfigMap`	Specifies the name of the ConfigMap with CA certs for your proxy server.	None
`spec.proxy.httpProxy`	Specifies the address of a proxy server that should be used for outbound HTTP requests.	None
`spec.proxy.httpsProxy`	Specifies the address of a proxy server that should be used for outbound HTTPS requests	None
`spec.proxy.noProxy`	Specifies a comma-separated list of domain names, IP addresses, or IP ranges for which proxying should be bypassed.	None
`spec.resources`	Specifies the resource requirements for the pods.	None
`spec.replicas`	Specifies the desired number of pods in the replica set for the NIM microservice.	`1`
`spec.runtimeClassName`	Specifies the underlying container runtime class name to be used for running NIM with NVIDIA GPUs allocated. If not set, the default `nvidia` runtime class is assigned automatically. This runtime class is created by the NVIDIA GPU Operator.	None
`spec.scale.enabled`	When set to `true`, the Operator creates a Kubernetes horizontal pod autoscaler for the NIM microservice. Specify the HPA specification in the `spec.scale.hpa` field. The `spec.scale.hpa` field supports the following subfields: `minReplicas`, `maxReplicas`, `metrics`, and `behavior`. These fields correspond to the same fields in a horizontal pod autoscaler resource specification. If you define `spec.scale.enabled` and HPA values, you can’t have `spec.replicas` defined.	`false`
`spec.storage.nimCache`	Specifies the name of the NIM cache that has the cached model profiles for the NIM microservice. Specify values for the `name` subfield and optionally, the `profile` subfield. This field has precedence over the `spec.storage.pvc` field. Refer to Displaying Cached Model Profiles for details on viewing available models profile names in a NIM cache. Supported profiles for Multi-LLM NIM are `trtllm`, `sglang`, and `vllm`.	None
`spec.storage.hostPath`	HostPath is the host path volume for caching NIM microservices. Make sure the exisiting target directory has appropriate read/write permissions (typically `chmod 777`) before deployment to avoid caching failures. For Openshift, the host folder needs additional permission `sudo chcon -Rt container_file_t ${host_folder}`. Permission-related errors will be reported in the NIM Service status and pod events.	None
`spec.storage.pvc`	If you did not create a NIM cache resource to download and cache your model, you can specify this field to download model profiles. This field has the following subfields: `annotations`, `create`, `name`, `size`, `storageClass`, `volumeAccessMode`, and `subPath`. To have the Operator create a PVC for the model profiles, specify `pvc.create: true`. Refer to Example: Create a PVC Instead of Using a NIM Cache.	None
`spec.storage.emptyDir`	Specifies ephemeral storage using an emptyDir volume for the NIM microservice. This enables fast temporary storage for testing, demos, or stateless workloads without requiring persistent volume claims. Data is not retained after pod restarts. This storage type is intended for temporary workloads only and should not be used for production deployments requiring data persistence.	None
`spec.storage.emptyDir.sizeLimit`	Specifies the size limit for the emptyDir volume. If not specified, the emptyDir volume is created with no size limit.	None
`spec.storage.readOnly`	When set to `true`, the Operator mounts the PVC from either the `pvc` or `nimCache` specification as read-only.	`false`
`spec.storage.sharedMemorySizeLimit`	Specifies the max size of the shared memory volume (emptyDir) used by NIM for fast model runtime read and write operations. If not specified, the NIM Operator will create an emptyDir with no limit.	None
`spec.schedulerName`	Specifies the custom scheduler to use for NIM deployments. If no scheduler is specified, then your configured default scheduler is used. This could be the Kubernetes `default-scheduler`, or a custom default scheduler, for example if you have configured Run:ai as the default scheduler for the NIM Operator namespace.	Kubernetes default scheduler.
`spec.tolerations`	Specifies the tolerations for the pods.	None
`spec.userID`	Specifies the user ID for the pod. This value is used to set the security context of the pod in the `runAsUser` fields.	`1000`

Prerequisites#

Refer to the NIM Service Configuration Reference for details on configurable fields.
Optional: Add a NIM cache resource for the NIM microservice. If you are using a NIM cache resource to store your model, specify the name in the spec.nimCache.name field.

If you prefer to have the service download a model directly to storage, refer to Example: Create a PVC Instead of Using a NIM Cache for a sample manifest.

Procedure#

Create a file, such as service.yaml, with contents like one of the following sample manifests:

Non-LLM NIM

# NIMService for Non-LLM
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-rerankqa-mistral-4b-v3
spec:
  image:
    repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3
    tag: 1.0.2
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-rerankqa-mistral-4b-v3
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

LLM-Specific NIM

# NIMService for LLM-specific
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-1-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: "1.3.3"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: llama-3-1-8b-instruct
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Multi-LLM NIM

# NIMService for Multi-LLM NGC NIMCache
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: ngc-nim-service-multi
spec:
  image:
    repository: nvcr.io/nim/nvidia/llm-nim
    tag: "1.11.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: ngc-nim-cache-multi
      profile: tensorrt_llm
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Apply the manifest:

$ kubectl apply -n nim-service -f service.yaml

Optional: View information about the NIM services:

$ kubectl describe nimservices.apps.nvidia.com -n nim-service

Partial Output

...
Conditions:
 Last Transition Time:  2024-08-12T19:09:43Z
 Message:               Deployment is ready
 Reason:                Ready
 Status:                True
 Type:                  Ready
 Last Transition Time:  2024-08-12T19:09:43Z
 Message:
 Reason:                Ready
 Status:                False
 Type:                  Failed
State:                  Ready

Verification#

Start a pod that has access to the curl command. Substitute any pod that has the command and meets your organization’s security requirements:
```
$ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
```
After the pod starts, you are connected to the ash shell in the pod.

Connect to the chat completions endpoint on the NIM for LLMs container.

The command connects to the service in the nim-service namespace, in this example, meta-llama3-8b-instruct.nim-service. The command specifies the model to use, meta/llama3.1-8b-instruct. Replace these values if you use a different service name, namespace, or model. To find the model name, refer to Displaying Model Status.

curl -X "POST" \
 'http://meta-llama3-8b-instruct.nim-service:8000/v1/chat/completions' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "meta/llama-3.1-8b-instruct",
        "messages": [
        {
          "content":"What should I do for a 4 day vacation at Cape Hatteras National Seashore?",
          "role": "user"
        }],
        "top_p": 1,
        "n": 1,
        "max_tokens": 1024,
        "stream": false,
        "frequency_penalty": 0.0,
        "stop": ["STOP"]
      }'

Press Ctrl+D to exit and delete the pod.

Displaying Model Status#

View the .status.model field of the custom resource.

Replace meta-llama3-8b-instruct with your NIM service name.

$ kubectl get nimservice.apps.nvidia.com -n nim-service \
    meta-llama3-8b-instruct  -o=jsonpath="{.status.model}" | jq .

Example Output

  {
    "clusterEndpoint": "",
    "externalEndpoint": "",
    "name": "meta/llama-3.1-8b-instruct"
  }

Configuring Horizontal Pod Autoscaling#

Prerequisites#

Prometheus installed on your cluster. Refer to the Observability page for details on installing and configuring Prometheus for the NIM Operator.

Autoscaling NIM for LLMs#

NVIDIA NIM for LLMs provides several service metrics. Refer to Observability in the NVIDIA NIM for LLMs documentation for information about the metrics.

Create a file (such as service-hpa.yaml) or update your NIMService manifest to include spec.metrics and spec.scale configuration. The following example shows the service-hpa.yaml file:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
  scale:
    enabled: true
    hpa:
      maxReplicas: 2
      minReplicas: 1
      metrics:
      - type: Object
        object:
          metric:
            name: gpu_cache_usage_perc
          describedObject:
            apiVersion: v1
            kind: Service
            name: meta-llama3-8b-instruct
          target:
            type: Value
            value: "0.5"

Apply the manifest:

$ kubectl apply -n nim-service -f service-hpa.yaml

Annotate the service resource related to NIM for LLMs:
```
$ kubectl annotate -n nim-service svc meta-llama3-8b-instruct prometheus.io/scrape=true
```
Prometheus might require several minutes to begin collecting metrics from the service.

Optional: Confirm Prometheus collects the metrics.

If you have access to the Prometheus dashboard, search for a service metric such as gpu_cache_usage_perc.

You can query Prometheus Adapter:

$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/services/*/gpu_cache_usage_perc" | jq .

Example Output

{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Service",
        "namespace": "nim-service",
        "name": "meta-llama3-8b-instruct",
        "apiVersion": "/v1"
      },
      "metricName": "gpu_cache_usage_perc",
      "timestamp": "2024-09-12T15:14:20Z",
      "value": "0",
      "selector": null
    }
  ]
}

Optional: Confirm the horizontal pod autoscaler resource is created:

$ kubectl get hpa -n nim-service

Example Output

NAME                      REFERENCE                            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
meta-llama3-8b-instruct   Deployment/meta-llama3-8b-instruct   0/500m    1         2         1          40s

Sample Manifests#

Example: Create a PVC Instead of Using a NIM Cache#

As an alternative to creating NIM cache resources to download and cache NIM model profiles, you can specify that the Operator create a PVC and the NIM service downloads and runs a NIM model profile.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: 10Gi
      volumeAccessMode: ReadWriteMany
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Example: Air-Gapped Environment#

For air-gapped environments, you must download the model profiles for the NIM microservices from a host that has internet access. You must manually create a PVC and then transfer the model profile files into the PVC.

Typically, the Operator determines the PVC name by dereferencing it from the NIM cache resource. When there is no NIM cache resource, such as an air-gapped environment, you must specify the PVC name.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      name: <existing-pvc-name>
    readOnly: true
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Example: Gateway API routing#

You can route traffic to a NIM Service by using Gateway API. Specify your Gateway configuration in the NIMService spec.expose.router.gateway field.

Before using a Gateway to route traffic to a NIM Service:

Install a Gateway API controller on your cluster. For example, Istio.
Deploy a Gateway to use.

HTTP Routes

Use spec.expose.router.gateway.httpRoutesEnabled: true for HTTP traffic. The example below creates a NIM Cache and NIM Service for an LLM with Gateway API.

# NIM Cache for LLM specific NIM
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

---
# NIM Service for LLM specific NIM with Autoscaling
# NOTE: Gateway controller (e.g. Istio) should be deployed as a pre-requisite
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
    router:
      gateway:
        namespace: nim-service
        name: istio-gateway
        httpRoutesEnabled: true
      hostDomainName: demo.nvidia.example.com

gRPC Routes

Use spec.expose.router.gateway.grpcRoutesEnabled: true for gRPC traffic. When using GRPC routes you must also specify spec.expose.service.grpcPort and your Gateway must support GRPCRoute resources. Only use this field for NIMs running Triton gRPC Inference Server (non-LLM NIMs).

The example creates a NIM Cache and NIM Service for Riva TTS.

---
# NIM Cache for Riva TTS
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: riva-tts
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/nvidia/riva-tts:1.3.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

---
# # NIM Service for Riva TTS
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: riva-tts
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/nvidia/riva-tts
    tag: "1.3.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: riva-tts
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 9000
      grpcPort: 50051
    router:
      gateway:
        namespace: nemo
        name: istio-gateway
        grpcRoutesEnabled: true
      hostDomainName: demo.nvidia.example.com

Deleting a NIM Service#

To delete a NIM service, complete the following steps:

View the NIM services custom resources:

$ kubectl get nimservices.apps.nvidia.com -A

Example Output

NAMESPACE     NAME                        STATUS   AGE
nim-service   meta-llama3-8b-instruct     Ready    2024-08-12T17:16:05Z

Delete the custom resource:

$ kubectl delete nimservice -n nim-service meta-llama3-8b-instruct

If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. To determine if the Operator created the PVC, run a command like the following example:

$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
   -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'

Example Output

meta-llama3-8b-instruct: true