API Reference#
NIM LLM exposes an OpenAI-compatible inference API backed by vLLM, along with NIM-specific management endpoints.
Inference Endpoints#
These endpoints are provided by the vLLM inference backend.
Endpoint |
Description |
|---|---|
|
Multi-turn chat completions with message history. Supports streaming and tool calling. |
|
Single-turn text completions. |
|
Create a model response (OpenAI Responses API). |
|
Retrieve a previously created response. |
|
Cancel a streaming response. |
|
Anthropic-compatible messages endpoint. |
|
Count tokens for a messages request without running inference. |
|
List models currently loaded and available for inference. |
|
Tokenize input text into token IDs. |
|
Convert token IDs back to text. |
Render endpoints return the formatted prompt without running inference:
Endpoint |
Description |
|---|---|
|
Render the chat template for a chat completion request. |
|
Render the prompt template for a completion request. |
For full request/response schemas and parameters, refer to the
vLLM OpenAI-Compatible Server documentation
or the interactive OpenAPI explorer at /docs on the running container.
For Anthropic-compatible request syntax, refer to the Anthropic Messages API documentation.
Management Endpoints#
These endpoints are specific to the NIM container and are served by the NIM middleware layer or the nginx proxy.
Endpoint |
Description |
|---|---|
|
Liveness probe. Returns 200 when the container is running (served by nginx; does not require model to be loaded). |
|
Readiness probe. Returns 200 when the model is loaded and inference is available. |
|
Deployment metadata including active profile, model info, and license. |
|
NIM release version and OpenAPI spec version. |
|
License metadata and full license text. |
|
Model manifest with available profiles and configurations. |
|
Prometheus-compatible metrics (request latency, throughput, queue depth, GPU utilization). |
Examples#
The examples below use ${MODEL_NAME} as a shell variable. To find
the model ID for your deployment, query the models endpoint:
curl -s http://localhost:8000/v1/models
Then export it for use in subsequent commands:
export MODEL_NAME="meta/llama-3.1-8b-instruct"
The model ID matches the value of NIM_SERVED_MODEL_NAME when the
variable is set explicitly. If the variable is not set, NIM derives the
name automatically. For more information, refer to
Environment Variables.
Chat Completions#
To query the Chat Completions API, run the following command:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"messages\": [{\"role\": \"user\", \"content\": \"What is GPU computing?\"}],
\"max_tokens\": 256
}"
To stream the response back to the client, run the following command:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"messages\": [{\"role\": \"user\", \"content\": \"Explain transformers briefly.\"}],
\"max_tokens\": 256,
\"stream\": true
}"
Completions#
To query the Completions API, run the following command:
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"prompt\": \"Once upon a time\",
\"max_tokens\": 64
}"
Responses (OpenAI Responses API)#
To query a model using the OpenAI Responses API, run the following command:
curl -s http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"input\": \"Explain the theory of relativity in one sentence.\"
}"
Messages (Anthropic-compatible)#
NIM exposes vLLM’s Anthropic-compatible server at /v1/messages. NIM routes
these requests through the nginx proxy and does not rewrite the request or
response body.
To send an Anthropic-compatible message request, run the following command:
curl -s http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
\"max_tokens\": 64
}"
To stream the response back to the client, run the following command:
curl -s http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"messages\": [{\"role\": \"user\", \"content\": \"Explain transformers briefly.\"}],
\"max_tokens\": 256,
\"stream\": true
}"
Streaming responses use Server-Sent Events (SSE) with Anthropic event types
such as message_start, content_block_start, content_block_delta,
content_block_stop, message_delta, and message_stop.
The Anthropic format uses a top-level system field instead of a system
message:
curl -s http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"max_tokens\": 256,
\"system\": \"You are a helpful coding assistant.\",
\"messages\": [{\"role\": \"user\", \"content\": \"Write a Python hello world.\"}]
}"
To count tokens in a request without running inference, run the following command:
curl -s http://localhost:8000/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]
}"
NIM does not validate the x-api-key header. When you use Anthropic SDK
clients, set the API key to any non-empty string that satisfies the client.
The /v1/messages endpoint implements the core Anthropic Messages API format.
Some Anthropic-specific features can depend on the model and vLLM version:
Tool use (
tool_useandtool_resultcontent blocks) requires a model with tool-calling support.Extended thinking (
thinkingcontent blocks) depends on model and vLLM support.The
anthropic-versionheader is accepted but not enforced.Batch and admin endpoints are not supported.
For Claude Code setup and model selection, refer to Use Claude Code with NIM.
Anthropic Python SDK#
To use the Anthropic Python SDK with NIM, point the client at the NIM endpoint and set the API key to any non-empty string:
import os
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:8000",
api_key="not-used",
)
message = client.messages.create(
model=os.environ["MODEL_NAME"],
max_tokens=1024,
messages=[{"role": "user", "content": "What is GPU computing?"}],
)
print(message.content)
To stream the response:
with client.messages.stream(
model=os.environ["MODEL_NAME"],
max_tokens=256,
messages=[{"role": "user", "content": "Explain CUDA in one paragraph."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
To count tokens without running inference:
response = client.messages.count_tokens(
model=os.environ["MODEL_NAME"],
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.input_tokens)
Tokenize and Detokenize#
To tokenize input text into token IDs, run the following command:
curl -s http://localhost:8000/tokenize \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"prompt\": \"Hello world\"
}"
To convert token IDs back to text, run the following command:
curl -s http://localhost:8000/detokenize \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL_NAME}\",
\"tokens\": [9906, 1917]
}"
List Models#
To list the available models, run the following command:
curl -s http://localhost:8000/v1/models
Health Checks#
To perform a liveness or readiness health check, run the following commands:
# Liveness (container running)
curl -s http://localhost:8000/v1/health/live
# Readiness (model loaded, ready for inference)
curl -s http://localhost:8000/v1/health/ready
Metadata and Version#
To query the deployment metadata and version, run the following commands:
curl -s http://localhost:8000/v1/metadata
curl -s http://localhost:8000/v1/version
NIM Management Endpoints#
In addition to the OpenAI-compatible inference API provided by the vLLM backend, NIM exposes the following management endpoints.
Health#
GET /v1/health/live#
Returns 200 OK when the container is running.
curl -s http://localhost:8000/v1/health/live
GET /v1/health/ready#
Returns 200 OK when the model is loaded and ready to
accept inference requests.
curl -s http://localhost:8000/v1/health/ready
Observability#
GET /v1/metrics#
Prometheus-compatible metrics including request latency, throughput, queue depth, and GPU utilization.
curl -s http://localhost:8000/v1/metrics
See also
Logging and Observability for Prometheus scrape configuration example.
Metadata#
GET /v1/version#
Returns the NIM release version and OpenAPI specification version.
curl -s http://localhost:8000/v1/version
GET /v1/metadata#
Returns deployment metadata including the active model profile ID and name.
curl -s http://localhost:8000/v1/metadata
GET /v1/manifest#
Returns the full model manifest describing available profiles and their configurations.
curl -s http://localhost:8000/v1/manifest
GET /v1/license#
Returns license information for the running NIM container.
curl -s http://localhost:8000/v1/license