API Reference#

NIM LLM exposes an OpenAI-compatible inference API backed by vLLM, along with NIM-specific management endpoints.

Inference Endpoints#

These endpoints are provided by the vLLM inference backend.

Endpoint

Description

POST /v1/chat/completions

Multi-turn chat completions with message history. Supports streaming and tool calling.

POST /v1/completions

Single-turn text completions.

POST /v1/responses

Create a model response (OpenAI Responses API).

GET  /v1/responses/{response_id}

Retrieve a previously created response.

POST /v1/responses/{response_id}/cancel

Cancel a streaming response.

POST /v1/messages

Anthropic-compatible messages endpoint.

POST /v1/messages/count_tokens

Count tokens for a messages request without running inference.

GET  /v1/models

List models currently loaded and available for inference.

POST /tokenize

Tokenize input text into token IDs.

POST /detokenize

Convert token IDs back to text.

Render endpoints return the formatted prompt without running inference:

Endpoint

Description

POST /v1/chat/completions/render

Render the chat template for a chat completion request.

POST /v1/completions/render

Render the prompt template for a completion request.

For full request/response schemas and parameters, refer to the vLLM OpenAI-Compatible Server documentation or the interactive OpenAPI explorer at /docs on the running container.

For Anthropic-compatible request syntax, refer to the Anthropic Messages API documentation.

Management Endpoints#

These endpoints are specific to the NIM container and are served by the NIM middleware layer or the nginx proxy.

Endpoint

Description

GET /v1/health/live

Liveness probe. Returns 200 when the container is running (served by nginx; does not require model to be loaded).

GET /v1/health/ready

Readiness probe. Returns 200 when the model is loaded and inference is available.

GET /v1/metadata

Deployment metadata including active profile, model info, and license.

GET /v1/version

NIM release version and OpenAPI spec version.

GET /v1/license

License metadata and full license text.

GET /v1/manifest

Model manifest with available profiles and configurations.

GET /v1/metrics

Prometheus-compatible metrics (request latency, throughput, queue depth, GPU utilization).

Examples#

The examples below use ${MODEL_NAME} as a shell variable. To find the model ID for your deployment, query the models endpoint:

curl -s http://localhost:8000/v1/models

Then export it for use in subsequent commands:

export MODEL_NAME="meta/llama-3.1-8b-instruct"

The model ID matches the value of NIM_SERVED_MODEL_NAME when the variable is set explicitly. If the variable is not set, NIM derives the name automatically. For more information, refer to Environment Variables.

Chat Completions#

To query the Chat Completions API, run the following command:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"What is GPU computing?\"}],
    \"max_tokens\": 256
  }"

To stream the response back to the client, run the following command:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Explain transformers briefly.\"}],
    \"max_tokens\": 256,
    \"stream\": true
  }"

Completions#

To query the Completions API, run the following command:

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"prompt\": \"Once upon a time\",
    \"max_tokens\": 64
  }"

Responses (OpenAI Responses API)#

To query a model using the OpenAI Responses API, run the following command:

curl -s http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"input\": \"Explain the theory of relativity in one sentence.\"
  }"

Messages (Anthropic-compatible)#

NIM exposes vLLM’s Anthropic-compatible server at /v1/messages. NIM routes these requests through the nginx proxy and does not rewrite the request or response body.

To send an Anthropic-compatible message request, run the following command:

curl -s http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
    \"max_tokens\": 64
  }"

To stream the response back to the client, run the following command:

curl -s http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Explain transformers briefly.\"}],
    \"max_tokens\": 256,
    \"stream\": true
  }"

Streaming responses use Server-Sent Events (SSE) with Anthropic event types such as message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop.

The Anthropic format uses a top-level system field instead of a system message:

curl -s http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"max_tokens\": 256,
    \"system\": \"You are a helpful coding assistant.\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Write a Python hello world.\"}]
  }"

To count tokens in a request without running inference, run the following command:

curl -s http://localhost:8000/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]
  }"

NIM does not validate the x-api-key header. When you use Anthropic SDK clients, set the API key to any non-empty string that satisfies the client.

The /v1/messages endpoint implements the core Anthropic Messages API format. Some Anthropic-specific features can depend on the model and vLLM version:

  • Tool use (tool_use and tool_result content blocks) requires a model with tool-calling support.

  • Extended thinking (thinking content blocks) depends on model and vLLM support.

  • The anthropic-version header is accepted but not enforced.

  • Batch and admin endpoints are not supported.

For Claude Code setup and model selection, refer to Use Claude Code with NIM.

Anthropic Python SDK#

To use the Anthropic Python SDK with NIM, point the client at the NIM endpoint and set the API key to any non-empty string:

import os
import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:8000",
    api_key="not-used",
)

message = client.messages.create(
    model=os.environ["MODEL_NAME"],
    max_tokens=1024,
    messages=[{"role": "user", "content": "What is GPU computing?"}],
)

print(message.content)

To stream the response:

with client.messages.stream(
    model=os.environ["MODEL_NAME"],
    max_tokens=256,
    messages=[{"role": "user", "content": "Explain CUDA in one paragraph."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

To count tokens without running inference:

response = client.messages.count_tokens(
    model=os.environ["MODEL_NAME"],
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)

print(response.input_tokens)

Tokenize and Detokenize#

To tokenize input text into token IDs, run the following command:

curl -s http://localhost:8000/tokenize \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"prompt\": \"Hello world\"
  }"

To convert token IDs back to text, run the following command:

curl -s http://localhost:8000/detokenize \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"tokens\": [9906, 1917]
  }"

List Models#

To list the available models, run the following command:

curl -s http://localhost:8000/v1/models

Health Checks#

To perform a liveness or readiness health check, run the following commands:

# Liveness (container running)
curl -s http://localhost:8000/v1/health/live

# Readiness (model loaded, ready for inference)
curl -s http://localhost:8000/v1/health/ready

Metadata and Version#

To query the deployment metadata and version, run the following commands:

curl -s http://localhost:8000/v1/metadata
curl -s http://localhost:8000/v1/version

NIM Management Endpoints#

In addition to the OpenAI-compatible inference API provided by the vLLM backend, NIM exposes the following management endpoints.

Health#
GET /v1/health/live#

Returns 200 OK when the container is running.

curl -s http://localhost:8000/v1/health/live
GET /v1/health/ready#

Returns 200 OK when the model is loaded and ready to accept inference requests.

curl -s http://localhost:8000/v1/health/ready
Observability#
GET /v1/metrics#

Prometheus-compatible metrics including request latency, throughput, queue depth, and GPU utilization.

curl -s http://localhost:8000/v1/metrics

See also

Logging and Observability for Prometheus scrape configuration example.

Metadata#
GET /v1/version#

Returns the NIM release version and OpenAPI specification version.

curl -s http://localhost:8000/v1/version
GET /v1/metadata#

Returns deployment metadata including the active model profile ID and name.

curl -s http://localhost:8000/v1/metadata
GET /v1/manifest#

Returns the full model manifest describing available profiles and their configurations.

curl -s http://localhost:8000/v1/manifest
GET /v1/license#

Returns license information for the running NIM container.

curl -s http://localhost:8000/v1/license