API Reference#

NIM VLM exposes an OpenAI-compatible inference API backed by vLLM, along with NIM-specific management endpoints.

Inference Endpoints#

These endpoints are provided by the vLLM inference backend.

Endpoint	Description
`POST /v1/chat/completions`	Multi-turn chat completions with message history. Supports streaming and tool calling.
`GET /v1/models`	List models currently loaded and available for inference.
`POST /tokenize`	Tokenize input text into token IDs.
`POST /detokenize`	Convert token IDs back to text.

Render endpoints return the formatted prompt without running inference:

Endpoint	Description
`POST /v1/chat/completions/render`	Render the chat template for a chat completion request.

For full request/response schemas and parameters, refer to the vLLM OpenAI-Compatible Server documentation or the interactive OpenAPI explorer at /docs on the running container.

Management Endpoints#

These endpoints are specific to the NIM container and are served by the NIM middleware layer or the nginx proxy.

Endpoint	Description
`GET /v1/health/live`	Liveness probe. Returns 200 when the container is running (served by nginx; does not require model to be loaded).
`GET /v1/health/ready`	Readiness probe. Returns 200 when the model is loaded and inference is available.
`GET /v1/metadata`	Deployment metadata including active profile, model info, and license.
`GET /v1/version`	NIM release version and OpenAPI spec version.
`GET /v1/license`	License metadata and full license text.
`GET /v1/manifest`	Model manifest with available profiles and configurations.
`GET /v1/metrics`	Prometheus-compatible metrics (request latency, throughput, queue depth, GPU utilization).

Examples#

The examples below use ${MODEL_NAME} as a shell variable. To find the model ID for your deployment, query the models endpoint:

curl -s http://localhost:8000/v1/models

Then export it for use in subsequent commands:

export MODEL_NAME="nvidia/nemotron-3-content-safety"

The model ID matches the value of NIM_SERVED_MODEL_NAME when the variable is set explicitly. If the variable is not set, NIM derives the name automatically. For more information, refer to Environment Variables.

Chat Completions#

To query the Chat Completions API, run the following command:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"What is GPU computing?\"}],
    \"max_tokens\": 256
  }"

To stream the response back to the client, run the following command:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Explain transformers briefly.\"}],
    \"max_tokens\": 256,
    \"stream\": true
  }"

Tokenize and Detokenize#

To tokenize input text into token IDs, run the following command:

curl -s http://localhost:8000/tokenize \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"prompt\": \"Hello world\"
  }"

To convert token IDs back to text, run the following command:

curl -s http://localhost:8000/detokenize \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL_NAME}\",
    \"tokens\": [9906, 1917]
  }"

List Models#

To list the available models, run the following command:

curl -s http://localhost:8000/v1/models

Health Checks#

To perform a liveness or readiness health check, run the following commands:

# Liveness (container running)
curl -s http://localhost:8000/v1/health/live

# Readiness (model loaded, ready for inference)
curl -s http://localhost:8000/v1/health/ready

Metadata and Version#

To query the deployment metadata and version, run the following commands:

curl -s http://localhost:8000/v1/metadata
curl -s http://localhost:8000/v1/version