For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
      • Reference Guide
      • Chat Processor
      • Examples
      • Disaggregation
      • Diffusion
      • Observability
      • Agentic Workloads
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • Worker Types
  • Argument Reference
  • Dynamo-Specific Arguments
  • Tokenizer Behavior
  • Request Cancellation
  • Graceful Shutdown
  • Health Checks
  • Metrics and KV Events
  • Prometheus Metrics
  • KV Events
  • Engine Routes
  • See Also
BackendsSGLang

Reference Guide

Architecture, configuration, and operational details for the SGLang backend
||View as Markdown|
Edit this page
Previous

SGLang

Next

SGLang Chat Processor

Overview

The SGLang backend in Dynamo uses a modular architecture where main.py dispatches to specialized initialization modules based on the worker type. Each worker type has its own init module, request handler, health check, and registration logic.

Dynamo SGLang uses SGLang’s native argument parser — all SGLang engine arguments (e.g., --model-path, --tp, --trust-remote-code) are passed through directly. Dynamo adds its own arguments for worker mode selection, tokenizer control, and disaggregation configuration.

Worker Types

Worker TypeDescription
Decode (default)Standard LLM inference (aggregated or disaggregated decode)
PrefillDisaggregated prefill phase (--disaggregation-mode prefill)
EmbeddingText embedding models (--embedding-worker)
Multimodal ProcessorHTTP entry point for multimodal, OpenAI-to-SGLang conversion (--multimodal-processor)
Multimodal EncodeVision encoder and embeddings generation (--multimodal-encode-worker)
Multimodal WorkerLLM inference with multimodal data (--multimodal-worker)
Multimodal PrefillPrefill phase for multimodal disaggregation (--multimodal-worker --disaggregation-mode prefill)
Image DiffusionImage generation via DiffGenerator (--image-diffusion-worker)
Video GenerationText/image-to-video via DiffGenerator (--video-generation-worker)
LLM DiffusionDiffusion language models like LLaDA (--dllm-algorithm <algo>)

Argument Reference

Dynamo-Specific Arguments

These arguments are added by Dynamo on top of SGLang’s native arguments.

ArgumentEnv VarDefaultDescription
--endpointDYN_ENDPOINTAuto-generatedDynamo endpoint in dyn://namespace.component.endpoint format
--use-sglang-tokenizerDYN_SGL_USE_TOKENIZERfalse[Deprecated] Use --dyn-chat-processor sglang on the frontend instead. See SGLang Chat Processor.
--dyn-tool-call-parserDYN_TOOL_CALL_PARSERNoneTool call parser (overrides SGLang’s --tool-call-parser)
--dyn-reasoning-parserDYN_REASONING_PARSERNoneReasoning parser for chain-of-thought models
--custom-jinja-templateDYN_CUSTOM_JINJA_TEMPLATENoneCustom chat template path (incompatible with --use-sglang-tokenizer)
--embedding-workerDYN_SGL_EMBEDDING_WORKERfalseRun as embedding worker (also sets SGLang’s --is-embedding)
--multimodal-processorDYN_SGL_MULTIMODAL_PROCESSORfalseRun as multimodal processor
--multimodal-encode-workerDYN_SGL_MULTIMODAL_ENCODE_WORKERfalseRun as multimodal encode worker
--multimodal-workerDYN_SGL_MULTIMODAL_WORKERfalseRun as multimodal LLM worker
--image-diffusion-workerDYN_SGL_IMAGE_DIFFUSION_WORKERfalseRun as image diffusion worker
--video-generation-workerDYN_SGL_VIDEO_GENERATION_WORKERfalseRun as video generation worker
--disagg-configDYN_SGL_DISAGG_CONFIGNonePath to YAML disaggregation config file
--disagg-config-keyDYN_SGL_DISAGG_CONFIG_KEYNoneKey to select from disaggregation config (e.g., prefill, decode)

--disagg-config and --disagg-config-key must be provided together. The selected section is written to a temp YAML file and passed to SGLang’s --config flag.

Tokenizer Behavior

By default, Dynamo handles tokenization and detokenization through its Rust-based frontend, passing input_ids to SGLang. This enables all frontend endpoints (v1/chat/completions, v1/completions, v1/embeddings).

For SGLang-native preprocessing (tool calling, reasoning parsing, chat templates), use --dyn-chat-processor sglang on the frontend. See SGLang Chat Processor for architecture and usage.

--use-sglang-tokenizer is deprecated. Use --dyn-chat-processor sglang on the frontend instead, which provides the same SGLang-native processing with KV router support and the completions endpoint.

Request Cancellation

When a client disconnects, Dynamo automatically cancels the in-flight request across all workers, freeing compute resources. A background cancellation monitor detects disconnection and aborts the SGLang request.

ModePrefillDecode
Aggregated✅✅
Disaggregated⚠️✅
Cancellation during remote prefill in disaggregated mode is not currently supported.

For details on the cancellation architecture, see Request Cancellation.

Graceful Shutdown

SGLang workers use Dynamo’s graceful shutdown mechanism. When a SIGTERM or SIGINT is received:

  1. Discovery unregister: The worker is removed from service discovery so no new requests are routed to it
  2. Grace period: In-flight requests are allowed to complete
  3. Deferred handlers: SGLang’s internal signal handlers (captured during startup via monkey-patching loop.add_signal_handler) are invoked after the graceful period

This ensures zero dropped requests during rolling updates or scale-down events.

For more details, see Graceful Shutdown.

Health Checks

Each worker type has a specialized health check payload that validates the full inference pipeline:

Worker TypeHealth Check Strategy
Decode / AggregatedShort generation request (max_new_tokens=1)
PrefillWrapped prefill-specific request structure
Image DiffusionMinimal image generation request
Video GenerationMinimal video generation request
EmbeddingStandard embedding request

Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. See Health Checks for the broader health check architecture.

Metrics and KV Events

Prometheus Metrics

Enable metrics with --enable-metrics on the worker. Set DYN_SYSTEM_PORT to expose the /metrics endpoint:

$DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --enable-metrics

Both SGLang engine metrics (sglang:* prefix) and Dynamo runtime metrics (dynamo_* prefix) are served from the same endpoint.

For metric details, see SGLang Observability. For visualization setup, see Prometheus + Grafana.

KV Events

When configured with --kv-events-config, workers publish KV cache events (block creation/deletion) for the KV-aware router. Events are published via ZMQ from SGLang’s scheduler and relayed through Dynamo’s event plane.

For DP attention mode (--enable-dp-attention), the publisher handles multiple DP ranks per node, each with its own KV event stream.

Engine Routes

SGLang workers expose operational endpoints via Dynamo’s system server:

RouteDescription
/engine/start_profileStart PyTorch profiling
/engine/stop_profileStop profiling and save traces
/engine/release_memory_occupationRelease GPU memory for maintenance
/engine/resume_memory_occupationResume GPU memory after release
/engine/update_weights_from_distributorUpdate model weights from distributor
/engine/update_weights_from_diskUpdate model weights from disk
/engine/update_weight_versionUpdate weight version metadata

See Also

  • Examples: All deployment patterns
  • Disaggregation: P/D architecture and KV transfer
  • Diffusion: LLM, image, and video diffusion models
  • Router Guide: KV-aware routing configuration