API Reference#
NIM for LLMs exposes an OpenAI-compatible inference API backed by vLLM, along with NIM-specific management endpoints.
Inference API#
NIM supports the standard OpenAI API endpoints provided by vLLM:
Endpoint |
Description |
|---|---|
|
Multi-turn chat completions with message history. Supports streaming and tool calling. Reference |
|
Single-turn text completions. Reference |
|
List models currently loaded and available for inference. |
|
Generate vector embeddings. Only available when serving an embedding model. Reference |
|
Tokenize input text into token IDs. Reference |
|
Convert token IDs back to text. Reference |
For full request/response schemas and parameters, refer to the vLLM OpenAI-Compatible Server documentation.
Examples#
Send a chat completion request:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "What is GPU computing?"}],
"max_tokens": 256
}'
Stream a chat completion:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Explain transformers briefly."}],
"stream": true
}'
List available models:
curl -s http://localhost:8000/v1/models
NIM Management Endpoints#
In addition to the OpenAI-compatible inference API provided by the vLLM backend, NIM exposes the following management endpoints.
Health#
GET /v1/health/live#
Returns 200 OK when the container is running.
curl -s http://localhost:8000/v1/health/live
GET /v1/health/ready#
Returns 200 OK when the model is loaded and ready to
accept inference requests.
curl -s http://localhost:8000/v1/health/ready
Observability#
GET /v1/metrics#
Prometheus-compatible metrics including request latency, throughput, queue depth, and GPU utilization.
curl -s http://localhost:8000/v1/metrics
See also
Logging and Observability for Prometheus scrape configuration example.
Metadata#
GET /v1/version#
Returns the NIM release version and OpenAPI specification version.
curl -s http://localhost:8000/v1/version
GET /v1/metadata#
Returns deployment metadata including the active model profile ID and name.
curl -s http://localhost:8000/v1/metadata
GET /v1/manifest#
Returns the full model manifest describing available profiles and their configurations.
curl -s http://localhost:8000/v1/manifest
GET /v1/license#
Returns license information for the running NIM container.
curl -s http://localhost:8000/v1/license