For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • 1) Start the SGLang worker with HiCache enabled
  • 2) Send a single request
  • 3) (Optional) Benchmarking
Integrations

SGLang HiCache

||View as Markdown|
Edit this page
Previous

LMCache

Next

FlexKV

This guide shows how to enable SGLang’s Hierarchical Cache (HiCache) inside Dynamo.

1) Start the SGLang worker with HiCache enabled

$python -m dynamo.sglang \
> --model-path Qwen/Qwen3-0.6B \
> --host 0.0.0.0 --port 8000 \
> --page-size 64 \
> --enable-hierarchical-cache \
> --hicache-ratio 2 \
> --hicache-write-policy write_through \
> --hicache-storage-backend nixl \
> --log-level debug \
> --skip-tokenizer-init
  • —enable-hierarchical-cache: Enables hierarchical KV cache/offload
  • —hicache-ratio: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
  • —hicache-write-policy: Write policy (e.g., write_through for synchronous host writes)
  • —hicache-storage-backend: Host storage backend for HiCache (e.g., nixl). NIXL selects the concrete store automatically; see PR #8488

Then, start the frontend:

$python -m dynamo.frontend --http-port 8000

2) Send a single request

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [
> {
> "role": "user",
> "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
> }
> ],
> "stream": false,
> "max_tokens": 30
> }'

3) (Optional) Benchmarking

Run the perf script:

$bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
> --model Qwen/Qwen3-0.6B \
> --tensor-parallelism 1 \
> --data-parallelism 1 \
> --concurrency "2,4,8" \
> --input-sequence-length 2048 \
> --output-sequence-length 256