Architecture#
NIM LLM is an enterprise orchestration layer for vLLM. It packages vLLM into a production-ready container with curated model profiles, validated configurations, and enterprise features such as health management, observability, and security hardening.
The container runs the following two processes:
vLLM inference backend: The model serving engine loads models, runs GPU inference, and exposes an OpenAI-compatible API.
Proxy: A thin proxy on the external port provides immediate liveness, model-aware readiness, request routing, TLS termination, and CORS handling.
High-Level Architecture#
The following diagram shows the main components in the NIM container and how requests flow between them:
This architecture adheres to the following key design principles:
OpenAI-compatible API: The proxy provides a drop-in replacement for OpenAI endpoints, including streaming. Existing client code works without changes.
Production-ready health probes: Separate liveness and readiness endpoints allow orchestrators to distinguish between a running container and one that is ready to serve inference. By default, these proxy health endpoints use
NIM_SERVER_PORT. If you setNIM_HEALTH_PORT, they move to that dedicated listener instead.Secure by default: Only explicitly configured endpoints are exposed. All other paths return
404 Not Found. TLS termination and CORS are configurable at the proxy layer.Fail-fast supervision: Both processes are monitored. If either exits, the container shuts down cleanly so the orchestrator can reschedule.
User Workflow#
The following diagram shows the typical lifecycle when deploying and using a NIM container, from launch through inference.
Container Startup#
When the container starts, it runs the following sequence:
Start the proxy: The proxy begins listening on
NIM_SERVER_PORT(default 8000). By default,/v1/health/liveand/v1/health/readyare also served on this port. If you setNIM_HEALTH_PORT, those health endpoints move to that dedicated listener.Select a model profile: NIM detects the available GPU hardware and selects a model profile that matches. Override the selection with
NIM_MODEL_PROFILEif needed.Download the model: Model files are fetched to the local cache (
NIM_CACHE_PATH). If the model is already cached, this step is skipped.Launch vLLM: The inference backend starts on port 8001 (loopback only, not exposed outside the container). Configuration is merged from profile defaults, environment variables, and passthrough arguments.
Report readiness: After the model loads, nginx checks the backend
/healthendpoint, and/v1/health/readybegins returning200 OK. The orchestrator then routes traffic to the container.
Both processes, vLLM and nginx, are supervised. If either exits unexpectedly, the container
shuts down so the orchestrator can reschedule it. On SIGTERM (for example,
docker stop), NIM stops vLLM gracefully first and then shuts down the proxy.
Inference#
NIM proxies vLLM’s OpenAI-compatible API. The following table summarizes how the main endpoints are routed:
Route |
Category |
Description |
|---|---|---|
|
Inference |
Multi-turn chat completions with message history. |
|
Inference |
Single-turn text completions. |
|
Inference |
Vector embedding generation. |
|
Management |
List models available for inference. |
|
Health |
Liveness probe, served directly by the proxy on |
|
Health |
Readiness probe, served by the proxy and backed by the backend |
All other paths |
– |
Rejected with |
Refer to API Reference for the full list of supported endpoints, request and response schemas, and usage examples.
Port Configuration#
NIM uses a two-port architecture by default. The external port (default 8000)
is where the proxy listens for inference and management traffic. The vLLM
backend listens on port 8001 and serves the native /health endpoint. If you
set NIM_HEALTH_PORT, nginx exposes /v1/health/live and /v1/health/ready
on that additional port.
Override the external port with NIM_SERVER_PORT if port 8000 conflicts with
another service. Refer to Environment Variables for the complete list of
configurable variables.
Observability#
NIM provides the following three observability surfaces:
Health probes:
/v1/health/liveconfirms that the container is running (no backend dependency)./v1/health/readyconfirms that the model is loaded and serving inference.Metrics: Prometheus-compatible metrics are exposed at
/v1/metrics, covering request latency, throughput, and GPU utilization.Logging and tracing: Configurable log levels, structured JSON Lines output, and distributed tracing header forwarding (
X-Request-IdandTraceparent) are supported.
Refer to Logging and Observability for configuration details, examples, and Prometheus scrape configuration.
Security#
NIM supports TLS termination (including mutual TLS) and configurable CORS policies at the proxy layer. Refer to Advanced Configuration for TLS and CORS behavior and examples, and to Environment Variables for the full set of SSL/TLS and CORS variables.