Architecture#
NIM for LLMs is an enterprise orchestration layer for vLLM. It packages vLLM into a production-ready container with curated model profiles, validated configurations, and enterprise features such as health management, observability, and security hardening.
The container runs the following two processes:
vLLM inference backend – the model serving engine that loads models, runs GPU inference, and exposes an OpenAI-compatible API.
Proxy – a thin proxy on the external port that provides immediate liveness, model-aware readiness, request routing, TLS termination, and CORS handling.
High-Level Architecture#
This architecture adheres to the following key design principles:
OpenAI-compatible API – Drop-in replacement for OpenAI endpoints, including streaming. Existing client code works without changes.
Production-ready health probes – Separate liveness and readiness endpoints allow orchestrators to distinguish between a running container and one that is ready to serve inference.
Secure by default – Only explicitly configured endpoints are exposed. All other paths return
404 Not Found. TLS termination and CORS are configurable at the proxy layer.Fail-fast supervision – Both processes are monitored. If either exits, the container shuts down cleanly so the orchestrator can reschedule.
User Workflow#
The following diagram shows the typical lifecycle when deploying and using a NIM container, from launch through inference.
Container Startup#
When the container starts, it runs the following sequence:
Start the proxy – The proxy begins listening on
NIM_SERVER_PORT(default 8000). Liveness checks (/v1/health/live) pass immediately.Select a model profile – NIM detects the available GPU hardware and selects a model profile that matches. Override the selection with
NIM_MODEL_PROFILEif needed.Download the model – Model files are fetched to the local cache (
NIM_CACHE_PATH). If the model is already cached, this step is skipped.Launch vLLM – The inference backend starts on port 8001 (loopback only, not exposed outside the container). Configuration is merged from profile defaults, environment variables, and passthrough arguments.
Report readiness – After the model loads,
/v1/health/readybegins returning200 OKand the orchestrator routes traffic to the container.
Both processes, vLLM and nginx, are supervised. If either exits unexpectedly, the container
shuts down so the orchestrator can reschedule it. On SIGTERM (for example,
docker stop), NIM stops vLLM gracefully first and then shuts down the proxy.
Inference#
NIM proxies vLLM’s OpenAI-compatible API. The following table summarizes how the main endpoints are routed:
Route |
Category |
Description |
|---|---|---|
|
Inference |
Multi-turn chat completions with message history. |
|
Inference |
Single-turn text completions. |
|
Inference |
Vector embedding generation. |
|
Management |
List models available for inference. |
|
Health |
Liveness probe, served directly by the proxy. |
|
Health |
Readiness probe, proxied to vLLM; confirms the model is loaded. |
All other paths |
– |
Rejected with |
Refer to API Reference for the full list of supported endpoints, request and response schemas, and usage examples.
Port Configuration#
NIM uses a two-port architecture. The external port (default 8000) is where
the proxy listens and is the only port that needs to be published with -p. The
vLLM backend listens on port 8001, bound to the loopback interface, and is
never exposed outside the container.
Override the external port with NIM_SERVER_PORT if port 8000 conflicts with
another service. Refer to Environment Variables for the complete list of
configurable variables.
Observability#
NIM provides the following three observability surfaces:
Health probes –
/v1/health/liveconfirms the container is running (no backend dependency)./v1/health/readyconfirms the model is loaded and serving inference.Metrics – Prometheus-compatible metrics are exposed at
/v1/metrics, covering request latency, throughput, and GPU utilization.Logging and tracing – Configurable log levels, structured JSON Lines output, and distributed tracing header forwarding (
X-Request-IdandTraceparent).
Refer to Logging and Observability for configuration details, examples, and Prometheus scrape configuration.
Security#
NIM supports TLS termination (including mutual TLS) and configurable CORS policies at the proxy layer. Refer to Environment Variables for the full set of SSL/TLS and CORS variables.