Proxy Deployed NIM for LLMs#

NeMo platform provides inference capabilities for proxying all deployed NIM for LLMs through the NeMo APIs for inference. The NeMo capability of proxying NIM for LLMs exposes deployed NIM for LLMs through the single NeMo host endpoint, simplifying model discovery and inference.

Types of Models Auto-detected by NIM Proxy:

Fine-tuned models uploaded by NeMo Customizer.
NIM for LLMs deployed with the NeMo Deployment Management microservice.
NIM for LLMs deployed using Helm with the following label added to the spec object of the NIMService custom resource:
```
spec:
   labels:
      app.nvidia.com/nim-type: inference
```

This guide shows you how you can discover models and prompt with the NeMo capability of proxying NIM for LLMs.

Prerequisites#

The following are the prerequisites:

The NeMo NIM Proxy Helm chart is installed on your Kubernetes cluster. For more information about the Helm installation, see NeMo NIM Proxy Helm Chart Values Setup.
The NIM Proxy microservice is running and reachable through the NeMo platform host base URL, stored in an environment variable NEMO_BASE_URL.
At least one NIM is deployed.

Get List of Available Models#

Retrieve the list of available models using the following command.

export NEMO_BASE_URL=<nemo-base-url>
curl -X GET '${NEMO_BASE_URL}/v1/models'

From the list, you can choose any model for inference.

Run Inference#

To prompt the model and get inference responses, submit a POST request to one of the following APIs:

${NEMO_BASE_URL}/v1/completions for completion of a provided prompt.

${NEMO_BASE_URL}/v1/chat/completions for chat conversation.

curl -X POST '${NEMO_BASE_URL}/v1/chat/completions' \
   --data '{
      "model": "meta/llama-3.1-8b-instruct",
      "messages": [
         {
               "role": "user",
               "content": "What is the purpose of LLM token log probabilities? Answer with a single sentence."
         }
      ],
      "stream": false
   }'

For external models, add the authorization header with the API key that you used during the NIM deployment configuration:

curl -X POST '${NEMO_BASE_URL}/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $OPENAI_API_KEY' \
--data '{
      "model": "gpt-3.5-turbo",
      "messages": [
         {
            "role": "user",
            "content": "What is the purpose of LLM token log probabilities? Answer with 2 sentences."
         }
      ],
      "stream": false
   }'