Run Inference on Deployed NIM#

NeMo platform provides inference capabilities for proxying all deployed NIM through the NeMo APIs for inference. The NeMo capability of proxying NIM exposes deployed NIM through the single NeMo host endpoint, simplifying model discovery and inference.

Types of Models Auto-detected by NIM Proxy#

This guide shows you how you can discover deployed NIM microservices and run inference with the NeMo capability of proxying NIM.

Prerequisites#

The following are the prerequisites:

  • The NeMo NIM Proxy Helm chart is installed on your Kubernetes cluster. For more information about the Helm installation, see NeMo NIM Proxy Helm Chart Values Setup.

  • The NIM Proxy microservice is running and reachable through a base URL, independent of the NeMo platform host base URL. Store this base URL in an environment variable NIM_PROXY_BASE_URL.

  • At least one NIM, which is of the types listed in Types of Models Auto-detected by NIM Proxy, is deployed on your Kubernetes cluster.

Get List of Available Models#

Retrieve the list of available models through the following NIM Proxy API endpoint ${NIM_PROXY_BASE_URL}/v1/models.

Note

The NIM Proxy API endpoint /v1/models is different from the NeMo Entity Store microservice’s /v1/models API endpoint.

export NIM_PROXY_BASE_URL=<nim-proxy-base-url>
curl -X GET "${NIM_PROXY_BASE_URL}/v1/models"

From the list, you can choose any model for inference.

Run Inference#

To prompt the model and get inference responses, submit a POST request to one of the following APIs:

  • ${NIM_PROXY_BASE_URL}/v1/completions for completion of a provided prompt.

  • ${NIM_PROXY_BASE_URL}/v1/chat/completions for chat conversation.

    curl -X POST "${NIM_PROXY_BASE_URL}/v1/chat/completions" \
       --data '{
          "model": "meta/llama-3.1-8b-instruct",
          "messages": [
             {
                   "role": "user",
                   "content": "What is the purpose of LLM token log probabilities? Answer with a single sentence."
             }
          ],
          "stream": false
       }'
    

    For external models, add the authorization header with the API key that you used during the NIM deployment configuration:

    curl -X POST "${NIM_PROXY_BASE_URL}/v1/chat/completions" \
       --header 'Content-Type: application/json' \
       --header 'Authorization: Bearer $OPENAI_API_KEY' \
       --data '{
             "model": "gpt-3.5-turbo",
             "messages": [
                {
                   "role": "user",
                   "content": "What is the purpose of LLM token log probabilities? Answer with 2 sentences."
                }
             ],
             "stream": false
          }'