Run Inference on Deployed NIM#

NeMo platform provides inference capabilities for proxying all deployed NIM through the NeMo APIs for inference. The NeMo capability of proxying NIM exposes deployed NIM through the single NeMo host endpoint, simplifying model discovery and inference.

Types of Models Auto-detected by NIM Proxy#

Fine-tuned models uploaded by NeMo Customizer.
NIM deployed with the NeMo Deployment Management microservice.
NIM deployed using Helm with the following label added to the spec object of the NIMService custom resource:
```
spec:
   labels:
      app.nvidia.com/nim-type: inference
```

This guide shows you how you can discover deployed NIM microservices and run inference with the NeMo capability of proxying NIM.

Prerequisites#

The following are the prerequisites:

The NeMo NIM Proxy Helm chart is installed on your Kubernetes cluster. For more information about the Helm installation, see NIM Proxy Configuration.
You have stored the NIM Proxy host base URL in the NIM_PROXY_BASE_URL environment variable. The value for NIM_PROXY_BASE_URL will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nim.test. Otherwise, you will need to consult with your own cluster administrator.
At least one NIM, which is of the types listed in Types of Models Auto-detected by NIM Proxy, is deployed on your Kubernetes cluster.

Get List of Available Models#

Retrieve the list of available models through the following NIM Proxy API endpoint ${NIM_PROXY_BASE_URL}/v1/models.

Note

The NIM Proxy API endpoint /v1/models is different from the NeMo Entity Store microservice’s /v1/models API endpoint.

export NIM_PROXY_BASE_URL=<nim-proxy-base-url>
curl -X GET "${NIM_PROXY_BASE_URL}/v1/models"

From the list, you can choose any model for inference.

Run Inference#

To prompt the model and get inference responses, submit a POST request to one of the following APIs:

${NIM_PROXY_BASE_URL}/v1/completions for completion of a provided prompt.

${NIM_PROXY_BASE_URL}/v1/chat/completions for chat conversation.

curl -X POST "${NIM_PROXY_BASE_URL}/v1/chat/completions" \
   --data '{
      "model": "meta/llama-3.1-8b-instruct",
      "messages": [
         {
               "role": "user",
               "content": "What is the purpose of LLM token log probabilities? Answer with a single sentence."
         }
      ],
      "stream": false
   }'

For external models, add the authorization header with the API key that you used during the NIM deployment configuration:

curl -X POST "${NIM_PROXY_BASE_URL}/v1/chat/completions" \
   --header 'Content-Type: application/json' \
   --header 'Authorization: Bearer $OPENAI_API_KEY' \
   --data '{
         "model": "gpt-3.5-turbo",
         "messages": [
            {
               "role": "user",
               "content": "What is the purpose of LLM token log probabilities? Answer with 2 sentences."
            }
         ],
         "stream": false
      }'