The vLLM backend in Dynamo integrates vLLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation. Dynamo leverages vLLM’s native KV cache events, NIXL-based transfer mechanisms, and metric reporting.
Dynamo vLLM uses vLLM’s native argument parser — all vLLM engine arguments are passed through directly. Dynamo adds its own arguments for disaggregation mode, KV transfer, and prompt embeddings.
The vLLM backend accepts all upstream vLLM engine arguments plus Dynamo-specific arguments. The authoritative source is always the CLI:
The --help output is organized into the following groups:
DYN_* env vars.DYN_VLLM_* env vars.--model, --tensor-parallel-size, --kv-transfer-config, --kv-events-config, --enable-prefix-caching, etc.). See the vLLM serve args documentation.Dynamo supports vLLM prompt embeddings — pre-computed embeddings bypass tokenization in the Rust frontend and are decoded to tensors in the worker.
--enable-prompt-embeds (disabled by default)prompt_embeds field in the Completions APIWhen using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
PYTHONHASHSEED=0 for all vLLM processes when relying on Python’s built-in hashing for prefix caching.See the high-level notes in Router Design on deterministic event IDs.
vLLM workers use Dynamo’s graceful shutdown mechanism. When a SIGTERM or SIGINT is received:
DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS, default 5s)All vLLM endpoints use graceful_shutdown=True, meaning they wait for in-flight requests to finish before exiting. An internal VllmEngineMonitor also checks engine health every 2 seconds and initiates shutdown if the engine becomes unresponsive.
For more details, see Graceful Shutdown.
Each worker type has a specialized health check payload that validates the full inference pipeline:
Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. The payload can be overridden via DYN_HEALTH_CHECK_PAYLOAD environment variable. See Health Checks for the broader health check architecture.
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources.
For more details, see the Request Cancellation Architecture documentation.
Dynamo supports request migration to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the Request Migration Architecture documentation for configuration details.