Benchmark Dynamo with GenAI-Perf#
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. This tutorial demonstrates how to use GenAI-Perf to benchmark the performance of Dynamo.
Table of Contents#
Build Dynamo #
Build Dynamo and install sglang using the following commands
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
./container/build.sh
./container/run.sh -it --mount-workspace
cargo build --release
mkdir -p /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
cp /workspace/target/release/http /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
cp /workspace/target/release/llmctl /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
cp /workspace/target/release/dynamo-run /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
uv pip install -e .
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install sgl-kernel --force-reinstall --no-deps
uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
Benchmark Dynamo with GenAI-Perf #
Option1: Start Server using dynamo-run#
dynamo-run in=http out=sglang deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Run GenAI-Perf#
Run GenAI-Perf in another terminal:
genai-perf profile \
-m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--endpoint-type chat \
--synthetic-input-tokens-mean 128 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--url localhost:8080 \
--streaming \
--request-count 10 \
--warmup-request-count 2
Example output:
NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ Time To First Token (ms) │ 39.13 │ 21.23 │ 41.49 │ 41.48 │ 41.32 │ 41.29 │
│ Time To Second Token (ms) │ 14.50 │ 13.51 │ 17.50 │ 17.26 │ 15.14 │ 14.45 │
│ Request Latency (ms) │ 1,799.53 │ 1,783.41 │ 1,802.41 │ 1,802.39 │ 1,802.18 │ 1,801.73 │
│ Inter Token Latency (ms) │ 17.96 │ 17.95 │ 17.98 │ 17.98 │ 17.97 │ 17.97 │
│ Output Token Throughput Per User │ 55.67 │ 55.61 │ 55.71 │ 55.70 │ 55.70 │ 55.69 │
│ (tokens/sec/user) │ │ │ │ │ │ │
│ Output Sequence Length (tokens) │ 99.00 │ 99.00 │ 99.00 │ 99.00 │ 99.00 │ 99.00 │
│ Input Sequence Length (tokens) │ 128.00 │ 128.00 │ 128.00 │ 128.00 │ 128.00 │ 128.00 │
│ Output Token Throughput (tokens/sec) │ 54.52 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 0.55 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└──────────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Option2: Start Server using dynamo serve#
Ensure you have NATS and etcd running before starting the server.
cd deploy
docker compose up -d
cd examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
Run GenAI-Perf#
Run GenAI-Perf in another terminal:
genai-perf profile \
-m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--endpoint-type chat \
--synthetic-input-tokens-mean 128 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 20 \
--output-tokens-stddev 0 \
--streaming \
--request-count 10 \
--warmup-request-count 2
Example output:
NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time To First Token (ms) │ 140.18 │ 120.43 │ 264.11 │ 253.22 │ 155.25 │ 130.56 │
│ Time To Second Token (ms) │ 19.32 │ 17.84 │ 20.58 │ 20.58 │ 20.54 │ 20.43 │
│ Request Latency (ms) │ 530.43 │ 510.31 │ 654.62 │ 643.89 │ 547.32 │ 521.66 │
│ Inter Token Latency (ms) │ 20.54 │ 20.27 │ 20.82 │ 20.81 │ 20.66 │ 20.62 │
│ Output Token Throughput Per User │ 48.69 │ 48.03 │ 49.34 │ 49.32 │ 49.08 │ 48.92 │
│ (tokens/sec/user) │ │ │ │ │ │ │
│ Output Sequence Length (tokens) │ 20.00 │ 20.00 │ 20.00 │ 20.00 │ 20.00 │ 20.00 │
│ Input Sequence Length (tokens) │ 128.00 │ 128.00 │ 128.00 │ 128.00 │ 128.00 │ 128.00 │
│ Output Token Throughput (tokens/sec) │ 37.69 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 1.88 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└──────────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘