The SGLang backend in Dynamo uses a modular architecture where main.py dispatches to specialized initialization modules based on the worker type. Each worker type has its own init module, request handler, health check, and registration logic.
Dynamo SGLang uses SGLang’s native argument parser — all SGLang engine arguments (e.g., --model-path, --tp, --trust-remote-code) are passed through directly. Dynamo adds its own arguments for worker mode selection, tokenizer control, and disaggregation configuration.
These arguments are added by Dynamo on top of SGLang’s native arguments.
--disagg-config and --disagg-config-key must be provided together. The selected section is written to a temp YAML file and passed to SGLang’s --config flag.
By default, Dynamo handles tokenization and detokenization through its Rust-based frontend, passing input_ids to SGLang. This enables all frontend endpoints (v1/chat/completions, v1/completions, v1/embeddings).
For SGLang-native preprocessing (tool calling, reasoning parsing, chat templates), use --dyn-chat-processor sglang on the frontend. See SGLang Chat Processor for architecture and usage.
--use-sglang-tokenizer is deprecated. Use --dyn-chat-processor sglang on the frontend instead, which provides the same SGLang-native processing with KV router support and the completions endpoint.
When a client disconnects, Dynamo automatically cancels the in-flight request across all workers, freeing compute resources. A background cancellation monitor detects disconnection and aborts the SGLang request.
For details on the cancellation architecture, see Request Cancellation.
SGLang workers use Dynamo’s graceful shutdown mechanism. When a SIGTERM or SIGINT is received:
loop.add_signal_handler) are invoked after the graceful periodThis ensures zero dropped requests during rolling updates or scale-down events.
For more details, see Graceful Shutdown.
Each worker type has a specialized health check payload that validates the full inference pipeline:
Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. See Health Checks for the broader health check architecture.
Enable metrics with --enable-metrics on the worker. Set DYN_SYSTEM_PORT to expose the /metrics endpoint:
Both SGLang engine metrics (sglang:* prefix) and Dynamo runtime metrics (dynamo_* prefix) are served from the same endpoint.
For metric details, see SGLang Observability. For visualization setup, see Prometheus + Grafana.
When configured with --kv-events-config, workers publish KV cache events (block creation/deletion) for the KV-aware router. Events are published via ZMQ from SGLang’s scheduler and relayed through Dynamo’s event plane.
For DP attention mode (--enable-dp-attention), the publisher handles multiple DP ranks per node, each with its own KV event stream.
SGLang workers expose operational endpoints via Dynamo’s system server: