Run Inference on Deployed NIM#
NeMo platform provides inference capabilities for proxying all deployed NIM through the NeMo APIs for inference. The NeMo capability of proxying NIM exposes deployed NIM through the single NeMo host endpoint, simplifying model discovery and inference.
Types of Models Auto-detected by NIM Proxy#
Fine-tuned models uploaded by NeMo Customizer.
NIM deployed with the NeMo Deployment Management microservice.
NIM deployed using Helm with the following label added to the
spec
object of theNIMService
custom resource:spec: labels: app.nvidia.com/nim-type: inference
This guide shows you how you can discover deployed NIM microservices and run inference with the NeMo capability of proxying NIM.
Prerequisites#
The following are the prerequisites:
The NeMo NIM Proxy Helm chart is installed on your Kubernetes cluster. For more information about the Helm installation, see NeMo NIM Proxy Helm Chart Values Setup.
The NIM Proxy microservice is running and reachable through a base URL, independent of the NeMo platform host base URL. Store this base URL in an environment variable
NIM_PROXY_BASE_URL
.At least one NIM, which is of the types listed in Types of Models Auto-detected by NIM Proxy, is deployed on your Kubernetes cluster.
Get List of Available Models#
Retrieve the list of available models through the following NIM Proxy API endpoint ${NIM_PROXY_BASE_URL}/v1/models
.
Note
The NIM Proxy API endpoint /v1/models
is different from the NeMo Entity Store microservice’s /v1/models
API endpoint.
export NIM_PROXY_BASE_URL=<nim-proxy-base-url>
curl -X GET "${NIM_PROXY_BASE_URL}/v1/models"
From the list, you can choose any model for inference.
Run Inference#
To prompt the model and get inference responses, submit a POST
request to one of the following APIs:
${NIM_PROXY_BASE_URL}/v1/completions
for completion of a provided prompt.${NIM_PROXY_BASE_URL}/v1/chat/completions
for chat conversation.curl -X POST "${NIM_PROXY_BASE_URL}/v1/chat/completions" \ --data '{ "model": "meta/llama-3.1-8b-instruct", "messages": [ { "role": "user", "content": "What is the purpose of LLM token log probabilities? Answer with a single sentence." } ], "stream": false }'
For external models, add the authorization header with the API key that you used during the NIM deployment configuration:
curl -X POST "${NIM_PROXY_BASE_URL}/v1/chat/completions" \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer $OPENAI_API_KEY' \ --data '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "What is the purpose of LLM token log probabilities? Answer with 2 sentences." } ], "stream": false }'