Architecture#

NIM for LLMs is an enterprise orchestration layer for vLLM. It packages vLLM into a production-ready container with curated model profiles, validated configurations, and enterprise features such as health management, observability, and security hardening.

The container runs the following two processes:

  • vLLM inference backend – the model serving engine that loads models, runs GPU inference, and exposes an OpenAI-compatible API.

  • Proxy – a thin proxy on the external port that provides immediate liveness, model-aware readiness, request routing, TLS termination, and CORS handling.

High-Level Architecture#

flowchart TB client([Client Application]) subgraph container["NIM Container"] direction LR subgraph proxy["Proxy"] direction TB live["/v1/health/live"] ready["/v1/health/ready"] infer["/v1/chat/completions<br/>/v1/completions<br/>/v1/embeddings"] mgmt["/v1/models<br/>/v1/metrics"] end subgraph backend["vLLM Backend"] engine["Inference Engine"] end ready -- "proxy_pass" --> engine infer -- "proxy_pass" --> engine mgmt -- "proxy_pass" --> engine end client -- "NIM_SERVER_PORT" --> proxy style container fill:#f0f4ff,stroke:#4a6fa5,stroke-width:2px style proxy fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px style backend fill:#fff3e0,stroke:#e65100,stroke-width:1px style client fill:#e3f2fd,stroke:#1565c0,stroke-width:1px

This architecture adheres to the following key design principles:

  • OpenAI-compatible API – Drop-in replacement for OpenAI endpoints, including streaming. Existing client code works without changes.

  • Production-ready health probes – Separate liveness and readiness endpoints allow orchestrators to distinguish between a running container and one that is ready to serve inference.

  • Secure by default – Only explicitly configured endpoints are exposed. All other paths return 404 Not Found. TLS termination and CORS are configurable at the proxy layer.

  • Fail-fast supervision – Both processes are monitored. If either exits, the container shuts down cleanly so the orchestrator can reschedule.

User Workflow#

The following diagram shows the typical lifecycle when deploying and using a NIM container, from launch through inference.

sequenceDiagram participant User participant Orch as Orchestrator participant NIM as NIM Container User->>Orch: Deploy container Orch->>NIM: Start Orch->>NIM: GET /v1/health/ready NIM-->>Orch: 503 (model loading) Note over NIM: Model loaded Orch->>NIM: GET /v1/health/ready NIM-->>Orch: 200 OK User->>NIM: POST /v1/chat/completions NIM-->>User: Response

Container Startup#

When the container starts, it runs the following sequence:

  1. Start the proxy – The proxy begins listening on NIM_SERVER_PORT (default 8000). Liveness checks (/v1/health/live) pass immediately.

  2. Select a model profile – NIM detects the available GPU hardware and selects a model profile that matches. Override the selection with NIM_MODEL_PROFILE if needed.

  3. Download the model – Model files are fetched to the local cache (NIM_CACHE_PATH). If the model is already cached, this step is skipped.

  4. Launch vLLM – The inference backend starts on port 8001 (loopback only, not exposed outside the container). Configuration is merged from profile defaults, environment variables, and passthrough arguments.

  5. Report readiness – After the model loads, /v1/health/ready begins returning 200 OK and the orchestrator routes traffic to the container.

Both processes, vLLM and nginx, are supervised. If either exits unexpectedly, the container shuts down so the orchestrator can reschedule it. On SIGTERM (for example, docker stop), NIM stops vLLM gracefully first and then shuts down the proxy.

Inference#

NIM proxies vLLM’s OpenAI-compatible API. The following table summarizes how the main endpoints are routed:

Route

Category

Description

/v1/chat/completions

Inference

Multi-turn chat completions with message history.

/v1/completions

Inference

Single-turn text completions.

/v1/embeddings

Inference

Vector embedding generation.

/v1/models

Management

List models available for inference.

/v1/health/live

Health

Liveness probe, served directly by the proxy.

/v1/health/ready

Health

Readiness probe, proxied to vLLM; confirms the model is loaded.

All other paths

Rejected with 404 Not Found.

Refer to API Reference for the full list of supported endpoints, request and response schemas, and usage examples.

Port Configuration#

NIM uses a two-port architecture. The external port (default 8000) is where the proxy listens and is the only port that needs to be published with -p. The vLLM backend listens on port 8001, bound to the loopback interface, and is never exposed outside the container.

Override the external port with NIM_SERVER_PORT if port 8000 conflicts with another service. Refer to Environment Variables for the complete list of configurable variables.

Observability#

NIM provides the following three observability surfaces:

  • Health probes/v1/health/live confirms the container is running (no backend dependency). /v1/health/ready confirms the model is loaded and serving inference.

  • Metrics – Prometheus-compatible metrics are exposed at /v1/metrics, covering request latency, throughput, and GPU utilization.

  • Logging and tracing – Configurable log levels, structured JSON Lines output, and distributed tracing header forwarding (X-Request-Id and Traceparent).

Refer to Logging and Observability for configuration details, examples, and Prometheus scrape configuration.

Security#

NIM supports TLS termination (including mutual TLS) and configurable CORS policies at the proxy layer. Refer to Environment Variables for the full set of SSL/TLS and CORS variables.