Reference Guide | NVIDIA Dynamo Documentation

Overview

The vLLM backend in Dynamo integrates vLLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation. Dynamo leverages vLLM’s native KV cache events, NIXL-based transfer mechanisms, and metric reporting.

Dynamo vLLM uses vLLM’s native argument parser — all vLLM engine arguments are passed through directly. Dynamo adds its own arguments for disaggregation mode, KV transfer, and prompt embeddings.

Argument Reference

The vLLM backend accepts all upstream vLLM engine arguments plus Dynamo-specific arguments. The authoritative source is always the CLI:

$ python -m dynamo.vllm --help

The --help output is organized into the following groups:

Dynamo Runtime Options — Namespace, discovery backend, request/event plane, endpoint types, tool/reasoning parsers, and custom chat templates. These are common across all Dynamo backends and use DYN_* env vars.
Dynamo vLLM Options — Disaggregation mode, tokenizer selection, sleep mode, multimodal flags, vLLM-Omni pipeline configuration, headless mode, and ModelExpress. These use DYN_VLLM_* env vars.
vLLM Engine Options — All native vLLM arguments (--model, --tensor-parallel-size, --kv-transfer-config, --kv-events-config, --enable-prefix-caching, etc.). See the vLLM serve args documentation.

Prompt Embeddings

Dynamo supports vLLM prompt embeddings — pre-computed embeddings bypass tokenization in the Rust frontend and are decoded to tensors in the worker.

Enable with --enable-prompt-embeds (disabled by default)
Embeddings are sent as base64-encoded PyTorch tensors via the prompt_embeds field in the Completions API
NATS must be configured with a 15MB max payload for large embeddings (already set in default deployments)

Hashing Consistency for KV Events

When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:

Set PYTHONHASHSEED=0 for all vLLM processes when relying on Python’s built-in hashing for prefix caching.
If your vLLM version supports it, configure a deterministic prefix caching algorithm:

$ vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256

See the high-level notes in Router Design on deterministic event IDs.

Graceful Shutdown

vLLM workers use Dynamo’s graceful shutdown mechanism. When a SIGTERM or SIGINT is received:

Discovery unregister: The worker is removed from service discovery so no new requests are routed to it
Grace period: In-flight requests are allowed to complete (configurable via DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS, default 5s)
Resource cleanup: Engine resources and temporary files (Prometheus dirs, LoRA adapters) are released

All vLLM endpoints use graceful_shutdown=True, meaning they wait for in-flight requests to finish before exiting. An internal VllmEngineMonitor also checks engine health every 2 seconds and initiates shutdown if the engine becomes unresponsive.

For more details, see Graceful Shutdown.

Health Checks

Each worker type has a specialized health check payload that validates the full inference pipeline:

Worker Type	Health Check Strategy
Decode / Aggregated	Short generation request (`max_tokens=1`) using the model’s BOS token
Prefill	Same payload structure as decode, adapted for prefill request format
vLLM-Omni	Short generation request via AsyncOmni with the model’s BOS token

Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. The payload can be overridden via DYN_HEALTH_CHECK_PAYLOAD environment variable. See Health Checks for the broader health check architecture.

Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources.

	Prefill	Decode
Aggregated	✅	✅
Disaggregated	✅	✅

For more details, see the Request Cancellation Architecture documentation.

Request Migration

Dynamo supports request migration to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the Request Migration Architecture documentation for configuration details.