API Reference#

NIM LLM exposes an OpenAI-compatible inference API backed by vLLM, along with NIM-specific management endpoints.

Inference Endpoints#

These endpoints are provided by the vLLM inference backend.

Endpoint	Description
`POST /v1/chat/completions`	Multi-turn chat completions with message history. Supports streaming and tool calling.
`POST /v1/completions`	Single-turn text completions.
`POST /v1/responses`	Create a model response (OpenAI Responses API).
`GET /v1/responses/{response_id}`	Retrieve a previously created response.
`POST /v1/responses/{response_id}/cancel`	Cancel a streaming response.
`POST /v1/messages`	Anthropic-compatible messages endpoint.
`POST /v1/messages/count_tokens`	Count tokens for a messages request without running inference.
`GET /v1/models`	List models currently loaded and available for inference.
`POST /tokenize`	Tokenize input text into token IDs.
`POST /detokenize`	Convert token IDs back to text.

Render endpoints return the formatted prompt without running inference:

Endpoint	Description
`POST /v1/chat/completions/render`	Render the chat template for a chat completion request.
`POST /v1/completions/render`	Render the prompt template for a completion request.

For full request/response schemas and parameters, refer to the vLLM OpenAI-Compatible Server documentation or the interactive OpenAPI explorer at /docs on the running container.

For Anthropic-compatible request syntax, refer to the Anthropic Messages API documentation.

Management Endpoints#

These endpoints are specific to the NIM container and are served by the NIM middleware layer or the nginx proxy.

Endpoint	Description
`GET /v1/health/live`	Liveness probe. Returns 200 when the container is running (served by nginx; does not require model to be loaded).
`GET /v1/health/ready`	Readiness probe. Returns 200 when the model is loaded and inference is available.
`GET /v1/metadata`	Deployment metadata including active profile, model info, and license.
`GET /v1/version`	NIM release version and OpenAPI spec version.
`GET /v1/license`	License metadata and full license text.
`GET /v1/manifest`	Model manifest with available profiles and configurations.
`GET /v1/metrics`	Prometheus-compatible metrics (request latency, throughput, queue depth, GPU utilization).

Examples#

The examples below use ${MODEL_NAME} as a shell variable. To find the model ID for your deployment, query the models endpoint:

curl -s http://localhost:8000/v1/models

Then export it for use in subsequent commands:

export MODEL_NAME="meta/llama-3.1-8b-instruct"

The model ID matches the value of NIM_SERVED_MODEL_NAME when the variable is set explicitly. If the variable is not set, NIM derives the name automatically. For more information, refer to Environment Variables.

Chat Completions#

To query the Chat Completions API, run the following command:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"What is GPU computing?\"}],
    \"max_tokens\": 256
  }"

To stream the response back to the client, run the following command:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Explain transformers briefly.\"}],
    \"max_tokens\": 256,
    \"stream\": true
  }"

Completions#

To query the Completions API, run the following command:

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"prompt\": \"Once upon a time\",
    \"max_tokens\": 64
  }"

Responses (OpenAI Responses API)#

To query a model using the OpenAI Responses API, run the following command:

curl -s http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"input\": \"Explain the theory of relativity in one sentence.\"
  }"

Messages (Anthropic-compatible)#

NIM exposes vLLM’s Anthropic-compatible server at /v1/messages. NIM routes these requests through the nginx proxy and does not rewrite the request or response body.

To send an Anthropic-compatible message request, run the following command:

curl -s http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
    \"max_tokens\": 64
  }"

To stream the response back to the client, run the following command:

curl -s http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Explain transformers briefly.\"}],
    \"max_tokens\": 256,
    \"stream\": true
  }"

Streaming responses use Server-Sent Events (SSE) with Anthropic event types such as message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop.

The Anthropic format uses a top-level system field instead of a system message:

curl -s http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"max_tokens\": 256,
    \"system\": \"You are a helpful coding assistant.\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Write a Python hello world.\"}]
  }"

To count tokens in a request without running inference, run the following command:

curl -s http://localhost:8000/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]
  }"

NIM does not validate the x-api-key header. When you use Anthropic SDK clients, set the API key to any non-empty string that satisfies the client.

The /v1/messages endpoint implements the core Anthropic Messages API format. Some Anthropic-specific features can depend on the model and vLLM version:

Tool use (tool_use and tool_result content blocks) requires a model with tool-calling support.
Extended thinking (thinking content blocks) depends on model and vLLM support.
The anthropic-version header is accepted but not enforced.
Batch and admin endpoints are not supported.

For Claude Code setup and model selection, refer to Use Claude Code with NIM.

Anthropic Python SDK#

To use the Anthropic Python SDK with NIM, point the client at the NIM endpoint and set the API key to any non-empty string:

import os
import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:8000",
    api_key="not-used",
)

message = client.messages.create(
    model=os.environ["MODEL_NAME"],
    max_tokens=1024,
    messages=[{"role": "user", "content": "What is GPU computing?"}],
)

print(message.content)

To stream the response:

with client.messages.stream(
    model=os.environ["MODEL_NAME"],
    max_tokens=256,
    messages=[{"role": "user", "content": "Explain CUDA in one paragraph."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

To count tokens without running inference:

response = client.messages.count_tokens(
    model=os.environ["MODEL_NAME"],
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)

print(response.input_tokens)

Tokenize and Detokenize#

To tokenize input text into token IDs, run the following command:

curl -s http://localhost:8000/tokenize \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"prompt\": \"Hello world\"
  }"

To convert token IDs back to text, run the following command:

curl -s http://localhost:8000/detokenize \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"tokens\": [9906, 1917]
  }"

List Models#

To list the available models, run the following command:

curl -s http://localhost:8000/v1/models

Health Checks#

To perform a liveness or readiness health check, run the following commands:

# Liveness (container running)
curl -s http://localhost:8000/v1/health/live

# Readiness (model loaded, ready for inference)
curl -s http://localhost:8000/v1/health/ready

Metadata and Version#

To query the deployment metadata and version, run the following commands:

curl -s http://localhost:8000/v1/metadata
curl -s http://localhost:8000/v1/version

NIM Management Endpoints#

In addition to the OpenAI-compatible inference API provided by the vLLM backend, NIM exposes the following management endpoints.

Health#

GET /v1/health/live#

Returns 200 OK when the container is running.

curl -s http://localhost:8000/v1/health/live

GET /v1/health/ready#

Returns 200 OK when the model is loaded and ready to accept inference requests.

curl -s http://localhost:8000/v1/health/ready

Observability#

GET /v1/metrics#

Prometheus-compatible metrics including request latency, throughput, queue depth, and GPU utilization.

curl -s http://localhost:8000/v1/metrics

Metadata#

GET /v1/version#

Returns the NIM release version and OpenAPI specification version.

curl -s http://localhost:8000/v1/version

GET /v1/metadata#

Returns deployment metadata including the active model profile ID and name.

curl -s http://localhost:8000/v1/metadata

GET /v1/manifest#

Returns the full model manifest describing available profiles and their configurations.

curl -s http://localhost:8000/v1/manifest

GET /v1/license#

Returns license information for the running NIM container.

curl -s http://localhost:8000/v1/license