API Reference#

NIM for LLMs exposes an OpenAI-compatible inference API backed by vLLM, along with NIM-specific management endpoints.

Inference API#

NIM supports the standard OpenAI API endpoints provided by vLLM:

Endpoint	Description
`POST /v1/chat/completions`	Multi-turn chat completions with message history. Supports streaming and tool calling. Reference
`POST /v1/completions`	Single-turn text completions. Reference
`GET /v1/models`	List models currently loaded and available for inference.
`POST /v1/embeddings`	Generate vector embeddings. Only available when serving an embedding model. Reference
`POST /tokenize`	Tokenize input text into token IDs. Reference
`POST /detokenize`	Convert token IDs back to text. Reference

For full request/response schemas and parameters, refer to the vLLM OpenAI-Compatible Server documentation.

Examples#

Send a chat completion request:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "What is GPU computing?"}],
    "max_tokens": 256
  }'

Stream a chat completion:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Explain transformers briefly."}],
    "stream": true
  }'

List available models:

curl -s http://localhost:8000/v1/models

NIM Management Endpoints#

In addition to the OpenAI-compatible inference API provided by the vLLM backend, NIM exposes the following management endpoints.

Health#

GET /v1/health/live#

Returns 200 OK when the container is running.

curl -s http://localhost:8000/v1/health/live

GET /v1/health/ready#

Returns 200 OK when the model is loaded and ready to accept inference requests.

curl -s http://localhost:8000/v1/health/ready

Observability#

GET /v1/metrics#

Prometheus-compatible metrics including request latency, throughput, queue depth, and GPU utilization.

curl -s http://localhost:8000/v1/metrics

Metadata#

GET /v1/version#

Returns the NIM release version and OpenAPI specification version.

curl -s http://localhost:8000/v1/version

GET /v1/metadata#

Returns deployment metadata including the active model profile ID and name.

curl -s http://localhost:8000/v1/metadata

GET /v1/manifest#

Returns the full model manifest describing available profiles and their configurations.

curl -s http://localhost:8000/v1/manifest

GET /v1/license#

Returns license information for the running NIM container.

curl -s http://localhost:8000/v1/license