Profile Large Language Models with GenAI-Perf#

This tutorial will demonstrate how you can use GenAI-Perf to measure the performance of various inference endpoints such as KServe inference protocol and OpenAI API that are widely used across the industry.

Table of Contents#


Profile GPT-2 running on Triton + TensorRT-LLM #

You can follow the quickstart guide in the Triton CLI Github repository to serve GPT-2 on the Triton server with the TensorRT-LLM backend.

Run GenAI-Perf#

Run GenAI-Perf inside the Triton Inference Server SDK container:

genai-perf profile \
  -m gpt2 \
  --tokenizer gpt2 \
  --service-kind triton \
  --backend tensorrtllm \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --output-tokens-mean-deterministic \
  --streaming \
  --request-count 50 \
  --warmup-request-count 10

Example output:

                              NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time to first token (ms) │  13.68 │  11.07 │  21.50 │  18.81 │  14.29 │  13.97 │
│          Inter token latency (ms) │   1.86 │   1.28 │   2.11 │   2.11 │   2.01 │   1.95 │
│              Request latency (ms) │ 203.70 │ 180.33 │ 228.30 │ 225.45 │ 216.48 │ 211.72 │
│            Output sequence length │ 103.46 │  95.00 │ 134.00 │ 122.96 │ 108.00 │ 104.75 │
│             Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
│ Output token throughput (per sec) │ 504.02 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   4.87 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Profile GPT-2 running on Triton + vLLM #

You can follow the quickstart guide in the Triton CLI Github repository to serve GPT-2 on the Triton server with the vLLM backend.

Run GenAI-Perf#

Run GenAI-Perf inside the Triton Inference Server SDK container:

genai-perf profile \
  -m gpt2 \
  --tokenizer gpt2 \
  --service-kind triton \
  --backend vllm \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --output-tokens-mean-deterministic \
  --streaming \
  --request-count 50 \
  --warmup-request-count 10

Example output:

                              NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time to first token (ms) │  22.04 │  14.00 │  26.02 │  25.73 │  24.41 │  24.06 │
│          Inter token latency (ms) │   4.58 │   3.45 │   5.34 │   5.33 │   5.11 │   4.86 │
│              Request latency (ms) │ 542.48 │ 468.10 │ 622.39 │ 615.67 │ 584.73 │ 555.90 │
│            Output sequence length │ 115.15 │ 103.00 │ 143.00 │ 138.00 │ 120.00 │ 118.50 │
│             Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
│ Output token throughput (per sec) │ 212.04 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   1.84 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Profile Zephyr-7B-Beta running on OpenAI Chat API-Compatible Server #

Serve the model on the vLLM server with OpenAI Chat Completions API endpoint:

docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model HuggingFaceH4/zephyr-7b-beta --dtype float16

Run GenAI-Perf#

Run GenAI-Perf inside the Triton Inference Server SDK container:

genai-perf profile \
  -m HuggingFaceH4/zephyr-7b-beta \
  --tokenizer HuggingFaceH4/zephyr-7b-beta \
  --service-kind openai \
  --endpoint-type chat \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --streaming \
  --request-count 50 \
  --warmup-request-count 10

Example output:

                                    NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃                         Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│          Time to first token (ms) │    37.99 │    32.65 │    45.89 │    45.85 │    44.69 │    37.49 │
│          Inter token latency (ms) │    19.19 │    18.78 │    20.11 │    20.00 │    19.39 │    19.23 │
│              Request latency (ms) │ 1,915.41 │ 1,574.73 │ 2,027.20 │ 2,016.50 │ 1,961.22 │ 1,931.45 │
│            Output sequence length │    98.83 │    81.00 │   101.00 │   100.83 │   100.00 │   100.00 │
│             Input sequence length │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │
│ Output token throughput (per sec) │    51.55 │      N/A │      N/A │      N/A │      N/A │      N/A │
│      Request throughput (per sec) │     0.52 │      N/A │      N/A │      N/A │      N/A │      N/A │
└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

Profile GPT-2 running on OpenAI Completions API-Compatible Server #

Serve the model on the vLLM server with OpenAI Completions API endpoint:

docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024

Run GenAI-Perf#

Run GenAI-Perf inside the Triton Inference Server SDK container:

genai-perf profile \
  -m gpt2 \
  --tokenizer gpt2 \
  --service-kind openai \
  --endpoint-type completions \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --request-count 50 \
  --warmup-request-count 10

Example output:

                             NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│              Request latency (ms) │ 437.85 │ 328.10 │ 497.05 │ 495.28 │ 485.68 │ 460.91 │
│            Output sequence length │ 112.66 │  83.00 │ 123.00 │ 122.69 │ 119.90 │ 116.25 │
│             Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
│ Output token throughput (per sec) │ 257.21 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   2.28 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘