For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
  • Additional Resources
      • Reference Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Reference Guide
  • Overview
  • Argument Reference
  • Prompt Embeddings
  • Hashing Consistency for KV Events
  • Graceful Shutdown
  • Health Checks
  • Request Cancellation
  • Request Migration
  • See Also
Additional ResourcesvLLM Details

Reference Guide

Configuration, arguments, and operational details for the vLLM backend
||View as Markdown|
Edit this page
Previous

Dynamo Docs Guide

Reference Guide

Overview

The vLLM backend in Dynamo integrates vLLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation. Dynamo leverages vLLM’s native KV cache events, NIXL-based transfer mechanisms, and metric reporting.

Dynamo vLLM uses vLLM’s native argument parser — all vLLM engine arguments are passed through directly. Dynamo adds its own arguments for disaggregation mode, KV transfer, and prompt embeddings.

Argument Reference

The vLLM backend accepts all upstream vLLM engine arguments plus Dynamo-specific arguments. The authoritative source is always the CLI:

$python -m dynamo.vllm --help

The --help output is organized into the following groups:

  • Dynamo Runtime Options — Namespace, discovery backend, request/event plane, endpoint types, tool/reasoning parsers, and custom chat templates. These are common across all Dynamo backends and use DYN_* env vars.
  • Dynamo vLLM Options — Disaggregation mode, tokenizer selection, sleep mode, multimodal flags, vLLM-Omni pipeline configuration, headless mode, and ModelExpress. These use DYN_VLLM_* env vars.
  • vLLM Engine Options — All native vLLM arguments (--model, --tensor-parallel-size, --kv-transfer-config, --kv-events-config, --enable-prefix-caching, etc.). See the vLLM serve args documentation.

Prompt Embeddings

Dynamo supports vLLM prompt embeddings — pre-computed embeddings bypass tokenization in the Rust frontend and are decoded to tensors in the worker.

  • Enable with --enable-prompt-embeds (disabled by default)
  • Embeddings are sent as base64-encoded PyTorch tensors via the prompt_embeds field in the Completions API
  • NATS must be configured with a 15MB max payload for large embeddings (already set in default deployments)

Hashing Consistency for KV Events

When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:

  • Set PYTHONHASHSEED=0 for all vLLM processes when relying on Python’s built-in hashing for prefix caching.
  • If your vLLM version supports it, configure a deterministic prefix caching algorithm:
$vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256

See the high-level notes in Router Design on deterministic event IDs.

Graceful Shutdown

vLLM workers use Dynamo’s graceful shutdown mechanism. When a SIGTERM or SIGINT is received:

  1. Discovery unregister: The worker is removed from service discovery so no new requests are routed to it
  2. Grace period: In-flight requests are allowed to complete (configurable via DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS, default 5s)
  3. Resource cleanup: Engine resources and temporary files (Prometheus dirs, LoRA adapters) are released

All vLLM endpoints use graceful_shutdown=True, meaning they wait for in-flight requests to finish before exiting. An internal VllmEngineMonitor also checks engine health every 2 seconds and initiates shutdown if the engine becomes unresponsive.

For more details, see Graceful Shutdown.

Health Checks

Each worker type has a specialized health check payload that validates the full inference pipeline:

Worker TypeHealth Check Strategy
Decode / AggregatedShort generation request (max_tokens=1) using the model’s BOS token
PrefillSame payload structure as decode, adapted for prefill request format
vLLM-OmniShort generation request via AsyncOmni with the model’s BOS token

Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. The payload can be overridden via DYN_HEALTH_CHECK_PAYLOAD environment variable. See Health Checks for the broader health check architecture.

Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources.

PrefillDecode
Aggregated✅✅
Disaggregated✅✅

For more details, see the Request Cancellation Architecture documentation.

Request Migration

Dynamo supports request migration to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the Request Migration Architecture documentation for configuration details.

See Also

  • Examples: All deployment patterns with launch scripts
  • vLLM README: Quick start and feature overview
  • Observability: Metrics and monitoring setup
  • Router Guide: KV-aware routing configuration
  • Fault Tolerance: Request migration, cancellation, and graceful shutdown