Elastic NIM#

NVIDIA Inference Microservices (NIM) are pre-built containers that enable secure, high-performance AI model inferencing. This guide shows you how to deploy NIMs on NVIDIA Cloud Functions (NVCF), a fully managed service that simplifies NIM deployment and management.

Benefits of Running NIM on NVCF#

High Performance: Optimized for NVIDIA GPUs with automatic resource allocation and load balancing
Simplified Operations: Streamlined NIM function creation and one-click deployment - no need to manage Kubernetes clusters or configure scaling policies
Hardware Awareness: Automatically selects the best hardware resources based on workload requirements.
Cost Optimization: Built-in auto-scaling (including scale-to-zero) reduces infrastructure costs
Enterprise-Ready: Secure deployment with automatic updates, monitoring, and enterprise support. This guide uses the Llama 3 8B NIM as an example, but the instructions apply to any NIM available through NGC.

Prerequisites#

Before deploying a NIM through NVCF, ensure your Kubernetes environment meets both the software and hardware requirements:

Create and Deploy the Function#

This section guides you through deploying a NIM using NVCF. Here’s an overview of the process:

Create a new function using the appropriate NIM container.
Deploy a version of the function.
Manage the function with the NGC CLI.

Note

Before you begin, please ensure you have completed all the requirements in the Register a Kubernetes Cluster with NVCF and Register NIM with Private Registry sections.

Follow these steps to quickly get started with your NIM deployment:

Log in to your NVCF account and navigate to the Functions page.

NVCF interface showing the Functions page with available tabs#
Click “Create Function” and choose “Elastic NIM” to start creating a new function.

This image shows the button used to create a new function.#
Fill in the function details, some fields will populate automatically:

This image shows the details for creating an Elastic NIM.#
- NIM: Select NIM from the dropdown
- Tag: Choose a tag
- Model Configuration: A profile that matches available hardware
- Prefix: (Optional) Prefix for the function name
- Description: (Optional) Description
- Metadata Tags: (Optional) Metadata tags for the function
- Telemetry Endpoint: (Optional) Select a configured telemetry endpoint for monitoring. For details on configuring telemetry endpoints, see External Observability.

Attention

Not all NIMs are currently onboarded to Elastic NIM. To manually deploy a downloadable NIM as a function, follow the Manual NIM Deployment guide.

Review and Deploy a Version of the Function

Configure deployment settings for your function version#
- GPU Type: Select the appropriate GPU type for your workload (e.g., L40S, H100)
- Min Instances: Minimum number of function instances to maintain, even when idle. Set this to 0 for scale-to-zero capability
- Max Instances: Maximum number of function instances that can be created to handle increased load
- Max Concurrency: Maximum number of simultaneous requests a single function instance can handle. Higher values improve throughput but require more memory
The deployment process will begin, and NVCF will deploy the NIM container to the cluster.

Note

The initial deployment may take a few minutes. You can monitor the status in the NVCF UI or using the CLI.

Managing Your Function using the NGC CLI#

NVCF can be managed using the NGC CLI. These steps show how to install and configure the NGC CLI and use it to manage your functions.

Note

Before using the NGC CLI, ensure you have created a Personal API Key as described in the prerequisites section.

Export NGC_API_KEY Environment Variable#

export NGC_API_KEY=<nvapi-...>

List All Functions#

ngc cloud-function function list

Get Function Information (Specify Version)#

ngc cloud-function function info <function-id>:<version-id>

Delete a Function#

ngc cloud-function function remove <function-id>:<version-id>

Test the NIM Function with Sample Input#

Note

This example demonstrates how to use the OpenAI API through the NIM framework.

curl -X POST \
  https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/${FUNCTION_ID} \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  -H "Authorization: Bearer ${NGC_API_KEY}" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
      {
        "role": "user",
        "content": "What is machine learning?"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 100,
    "stream": false
  }'

The response will include the model’s completion of your prompt, confirming that the function is working correctly.

For more information about NGC CLI commands, refer to the NGC CLI Documentation.

Deployment Best Practices#

Scaling Configuration Set appropriate minimum and maximum instance counts based on your workload Configure max concurrency based on your container’s capabilities Enable autoscaling to handle varying workloads efficiently Monitor function request queue depth for scaling decisions

Note

Before deploying, it’s recommended to run the Deployment Validator to catch common configuration issues.

Troubleshooting Guide#

Function Deployment Taking Too Long If the function deployment process is taking longer than expected:

Remember that NIMs need to download the model weights which can be several GB in size.
Verify that your NGC Personal API Key is valid and has the appropriate permissions

Check the event logs for any error messages or issues:

# Check function status and events
ngc cloud-function function info <function-id>:<version-id>

function-id and version-id can be found in the NVCF UI or using the CLI.

# Check pod events in the nvcf-backend namespace
kubectl get events -n nvcf-backend

# View pod logs
kubectl logs <pod-name> -n nvcf-backend

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# View total and available GPUs
kubectl get nodes -o=custom-columns=NAME:.metadata.name,TOTAL_GPUS:.status.capacity.'nvidia\.com/gpu',AVAILABLE_GPUS:.status.allocatable.'nvidia\.com/gpu'

Ensure there are sufficient resources in your cluster:

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# View total and available GPUs
kubectl get nodes -o=custom-columns=NAME:.metadata.name,TOTAL_GPUS:.status.capacity.'nvidia\.com/gpu',AVAILABLE_GPUS:.status.allocatable.'nvidia\.com/gpu'

GPU Scheduling Issues If pods cannot be scheduled:

Verify that GPU-enabled pods can be scheduled in your cluster:

# Check if pods are scheduled and running
kubectl get pods -n nvcf-system

# View detailed pod status and events
kubectl describe pods -n nvcf-system

# Check GPU device plugin pods
kubectl get pods -n nvidia-gpu-operator

# View GPU operator status
kubectl get clusterpolicy -n nvidia-gpu-operator

# Check if nodes have required GPU labels (should see nvidia.com/gpu.present=true)
kubectl get nodes --show-labels

# Check for taints that might prevent pod scheduling
kubectl describe nodes | grep Taint

Check that GPUs are available and properly configured:

# Check GPU devices on nodes (replace xxxx with actual pod name)
kubectl exec -it -n gpu-operator nvidia-device-plugin-daemonset-xxxx -- nvidia-smi

# Get the actual pod name
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

# Verify GPU feature discovery is running
kubectl get pods -n gpu-operator -l app=gpu-feature-discovery
kubectl logs -n gpu-operator -l app=gpu-feature-discovery

# Check GPU metrics from DCGM
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter

Ensure no taints exist that would block the scheduler:

# List node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Related Resources

NVIDIA Cloud Functions

Function Deployment Guide

Function Lifecycle

NVIDIA NIM Microservices