Enable SGLang Hierarchical Cache (HiCache)#

This guide shows how to enable SGLang’s Hierarchical Cache (HiCache) inside Dynamo.

1) Start the SGLang worker with HiCache enabled#

python -m dynamo.sglang \
  --model-path Qwen/Qwen3-0.6B \
  --host 0.0.0.0 --port 8000 \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-ratio 2 \
  --hicache-write-policy write_through \
  --hicache-storage-backend nixl \
  --log-level debug \
  --skip-tokenizer-init
  • –enable-hierarchical-cache: Enables hierarchical KV cache/offload

  • –hicache-ratio: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.

  • –hicache-write-policy: Write policy (e.g., write_through for synchronous host writes)

  • –hicache-storage-backend: Host storage backend for HiCache (e.g., nixl). NIXL selects the concrete store automatically; see PR #8488

Then, start the frontend:

python -m dynamo.frontend --http-port 8000

2) Send a single request#

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {
        "role": "user",
        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
      }
    ],
    "stream": false,
    "max_tokens": 30
  }'

3) (Optional) Benchmarking#

Run the perf script:

bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
  --model Qwen/Qwen3-0.6B \
  --tensor-parallelism 1 \
  --data-parallelism 1 \
  --concurrency "2,4,8" \
  --input-sequence-length 2048 \
  --output-sequence-length 256