Architecture#

NIM LLM is an enterprise orchestration layer for vLLM. It packages vLLM into a production-ready container with curated model profiles, validated configurations, and enterprise features such as health management, observability, and security hardening.

The container runs the following two processes:

  • vLLM inference backend: The model serving engine loads models, runs GPU inference, and exposes an OpenAI-compatible API.

  • Proxy: A thin proxy on the external port provides immediate liveness, model-aware readiness, request routing, TLS termination, and CORS handling.

High-Level Architecture#

The following diagram shows the main components in the NIM container and how requests flow between them:

flowchart TB client([Client Application]) subgraph container["NIM Container"] direction LR subgraph proxy["Proxy"] direction TB live["/v1/health/live"] ready["/v1/health/ready"] infer["/v1/chat/completions<br/>/v1/completions<br/>/v1/embeddings"] mgmt["/v1/models<br/>/v1/metrics"] end subgraph backend["vLLM Backend"] engine["Inference Engine"] end ready -- "proxy_pass" --> engine infer -- "proxy_pass" --> engine mgmt -- "proxy_pass" --> engine end client -- "NIM_SERVER_PORT or NIM_HEALTH_PORT" --> proxy style container fill:#f0f4ff,stroke:#4a6fa5,stroke-width:2px style proxy fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px style backend fill:#fff3e0,stroke:#e65100,stroke-width:1px style client fill:#e3f2fd,stroke:#1565c0,stroke-width:1px

This architecture adheres to the following key design principles:

  • OpenAI-compatible API: The proxy provides a drop-in replacement for OpenAI endpoints, including streaming. Existing client code works without changes.

  • Production-ready health probes: Separate liveness and readiness endpoints allow orchestrators to distinguish between a running container and one that is ready to serve inference. By default, these proxy health endpoints use NIM_SERVER_PORT. If you set NIM_HEALTH_PORT, they move to that dedicated listener instead.

  • Secure by default: Only explicitly configured endpoints are exposed. All other paths return 404 Not Found. TLS termination and CORS are configurable at the proxy layer.

  • Fail-fast supervision: Both processes are monitored. If either exits, the container shuts down cleanly so the orchestrator can reschedule.

User Workflow#

The following diagram shows the typical lifecycle when deploying and using a NIM container, from launch through inference.

sequenceDiagram participant User participant Orch as Orchestrator participant NIM as NIM Container User->>Orch: Deploy container Orch->>NIM: Start Orch->>NIM: GET /v1/health/ready NIM-->>Orch: 503 (model loading) Note over NIM: Model loaded Orch->>NIM: GET /v1/health/ready NIM-->>Orch: 200 OK User->>NIM: POST /v1/chat/completions NIM-->>User: Response

Container Startup#

When the container starts, it runs the following sequence:

  1. Start the proxy: The proxy begins listening on NIM_SERVER_PORT (default 8000). By default, /v1/health/live and /v1/health/ready are also served on this port. If you set NIM_HEALTH_PORT, those health endpoints move to that dedicated listener.

  2. Select a model profile: NIM detects the available GPU hardware and selects a model profile that matches. Override the selection with NIM_MODEL_PROFILE if needed.

  3. Download the model: Model files are fetched to the local cache (NIM_CACHE_PATH). If the model is already cached, this step is skipped.

  4. Launch vLLM: The inference backend starts on port 8001 (loopback only, not exposed outside the container). Configuration is merged from profile defaults, environment variables, and passthrough arguments.

  5. Report readiness: After the model loads, nginx checks the backend /health endpoint, and /v1/health/ready begins returning 200 OK. The orchestrator then routes traffic to the container.

Both processes, vLLM and nginx, are supervised. If either exits unexpectedly, the container shuts down so the orchestrator can reschedule it. On SIGTERM (for example, docker stop), NIM stops vLLM gracefully first and then shuts down the proxy.

Inference#

NIM proxies vLLM’s OpenAI-compatible API. The following table summarizes how the main endpoints are routed:

Route

Category

Description

/v1/chat/completions

Inference

Multi-turn chat completions with message history.

/v1/completions

Inference

Single-turn text completions.

/v1/embeddings

Inference

Vector embedding generation.

/v1/models

Management

List models available for inference.

/v1/health/live

Health

Liveness probe, served directly by the proxy on NIM_SERVER_PORT by default or NIM_HEALTH_PORT when configured.

/v1/health/ready

Health

Readiness probe, served by the proxy and backed by the backend /health check.

All other paths

Rejected with 404 Not Found.

Refer to API Reference for the full list of supported endpoints, request and response schemas, and usage examples.

Port Configuration#

NIM uses a two-port architecture by default. The external port (default 8000) is where the proxy listens for inference and management traffic. The vLLM backend listens on port 8001 and serves the native /health endpoint. If you set NIM_HEALTH_PORT, nginx exposes /v1/health/live and /v1/health/ready on that additional port.

Override the external port with NIM_SERVER_PORT if port 8000 conflicts with another service. Refer to Environment Variables for the complete list of configurable variables.

Observability#

NIM provides the following three observability surfaces:

  • Health probes: /v1/health/live confirms that the container is running (no backend dependency). /v1/health/ready confirms that the model is loaded and serving inference.

  • Metrics: Prometheus-compatible metrics are exposed at /v1/metrics, covering request latency, throughput, and GPU utilization.

  • Logging and tracing: Configurable log levels, structured JSON Lines output, and distributed tracing header forwarding (X-Request-Id and Traceparent) are supported.

Refer to Logging and Observability for configuration details, examples, and Prometheus scrape configuration.

Security#

NIM supports TLS termination (including mutual TLS) and configurable CORS policies at the proxy layer. Refer to Advanced Configuration for TLS and CORS behavior and examples, and to Environment Variables for the full set of SSL/TLS and CORS variables.