Proxy Deployed NIM for LLMs#
NeMo platform provides inference capabilities for proxying all deployed NIM for LLMs through the NeMo APIs for inference. The NeMo capability of proxying NIM for LLMs exposes deployed NIM for LLMs through the single NeMo host endpoint, simplifying model discovery and inference.
Types of Models Auto-detected by NIM Proxy:
Fine-tuned models uploaded by NeMo Customizer.
NIM for LLMs deployed with the NeMo Deployment Management microservice.
NIM for LLMs deployed using Helm with the following label added to the
spec
object of theNIMService
custom resource:spec: labels: app.nvidia.com/nim-type: inference
This guide shows you how you can discover models and prompt with the NeMo capability of proxying NIM for LLMs.
Prerequisites#
The following are the prerequisites:
The NeMo NIM Proxy Helm chart is installed on your Kubernetes cluster. For more information about the Helm installation, see NeMo NIM Proxy Helm Chart Values Setup.
The NIM Proxy microservice is running and reachable through the NeMo platform host base URL, stored in an environment variable
NEMO_BASE_URL
.At least one NIM is deployed.
Get List of Available Models#
Retrieve the list of available models using the following command.
export NEMO_BASE_URL=<nemo-base-url>
curl -X GET '${NEMO_BASE_URL}/v1/models'
From the list, you can choose any model for inference.
Run Inference#
To prompt the model and get inference responses, submit a POST
request to one of the following APIs:
${NEMO_BASE_URL}/v1/completions
for completion of a provided prompt.${NEMO_BASE_URL}/v1/chat/completions
for chat conversation.curl -X POST '${NEMO_BASE_URL}/v1/chat/completions' \ --data '{ "model": "meta/llama-3.1-8b-instruct", "messages": [ { "role": "user", "content": "What is the purpose of LLM token log probabilities? Answer with a single sentence." } ], "stream": false }'
For external models, add the authorization header with the API key that you used during the NIM deployment configuration:
curl -X POST '${NEMO_BASE_URL}/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $OPENAI_API_KEY' \
--data '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "What is the purpose of LLM token log probabilities? Answer with 2 sentences."
}
],
"stream": false
}'