API Reference#

NIM for LLMs exposes an OpenAI-compatible inference API backed by vLLM, along with NIM-specific management endpoints.

Inference API#

NIM supports the standard OpenAI API endpoints provided by vLLM:

Endpoint

Description

POST /v1/chat/completions

Multi-turn chat completions with message history. Supports streaming and tool calling. Reference

POST /v1/completions

Single-turn text completions. Reference

GET /v1/models

List models currently loaded and available for inference.

POST /v1/embeddings

Generate vector embeddings. Only available when serving an embedding model. Reference

POST /tokenize

Tokenize input text into token IDs. Reference

POST /detokenize

Convert token IDs back to text. Reference

For full request/response schemas and parameters, refer to the vLLM OpenAI-Compatible Server documentation.

Examples#

Send a chat completion request:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "What is GPU computing?"}],
    "max_tokens": 256
  }'

Stream a chat completion:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Explain transformers briefly."}],
    "stream": true
  }'

List available models:

curl -s http://localhost:8000/v1/models

NIM Management Endpoints#

In addition to the OpenAI-compatible inference API provided by the vLLM backend, NIM exposes the following management endpoints.

Health#
GET /v1/health/live#

Returns 200 OK when the container is running.

curl -s http://localhost:8000/v1/health/live
GET /v1/health/ready#

Returns 200 OK when the model is loaded and ready to accept inference requests.

curl -s http://localhost:8000/v1/health/ready
Observability#
GET /v1/metrics#

Prometheus-compatible metrics including request latency, throughput, queue depth, and GPU utilization.

curl -s http://localhost:8000/v1/metrics

See also

Logging and Observability for Prometheus scrape configuration example.

Metadata#
GET /v1/version#

Returns the NIM release version and OpenAPI specification version.

curl -s http://localhost:8000/v1/version
GET /v1/metadata#

Returns deployment metadata including the active model profile ID and name.

curl -s http://localhost:8000/v1/metadata
GET /v1/manifest#

Returns the full model manifest describing available profiles and their configurations.

curl -s http://localhost:8000/v1/manifest
GET /v1/license#

Returns license information for the running NIM container.

curl -s http://localhost:8000/v1/license