GenAI-Perf#

A tool to facilitate benchmarking generative AI models leveraging NVIDIA’s performance analyzer tool.

GenAI-Perf builds upon the performant stimulus generation of the performance analyzer to easily benchmark LLMs. Multiple endpoints are currently supported.

The GenAI-Perf workflow enables a user to

  • Generate prompts using either

    • synthetic generated data

    • open orca or CNN daily mail datasets

  • Transform the prompts to a format understood by the chosen endpoint

    • Triton Infer

    • OpenAI

  • Use Performance Analyzer to drive stimulus

  • Gather LLM relevant metrics

  • Generate reports

all from the command line.

[!Note] GenAI-Perf is currently in early release while under rapid development. While we will try to remain consistent, command line options are subject to change until the software hits 1.0. Known issues will also be documented as the tool matures.

Installation#

Triton SDK Container#

Available starting with the 24.03 release of the Triton Server SDK container.

RELEASE="24.03"

docker run -it --net=host --gpus=all  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

genai-perf --help

From Source#

This method requires that Perf Analyzer is installed in your development environment.

RELEASE="24.03"

pip install "git+https://github.com/triton-inference-server/client.git@r${RELEASE}#egg=genai-perf&subdirectory=src/c++/perf_analyzer/genai-perf"

genai-perf --help

Basic Usage#

Triton with TRT-LLM#

genai-perf -m llama-2-7b --concurrency 1 --service-kind triton --backend trtllm

Triton with vLLM#

genai-perf -m llama-2-7b --concurrency 1 --service-kind triton --backend vllm

OpenAI Chat Completions Compatible APIs#

https://platform.openai.com/docs/api-reference/chat

genai-perf -m llama-2-7b --concurrency 1 --service-kind openai --endpoint v1/chat/completions

OpenAI Completions Compatible APIs#

https://platform.openai.com/docs/api-reference/completions

genai-perf -m llama-2-7b --concurrency 1 --service-kind openai --endpoint v1/completions

Model Inputs#

GenAI-Perf supports model input prompts from either synthetically generated inputs, or from the HuggingFace OpenOrca or CNN_DailyMail datasets. This is specified using the --prompt-source CLI option.

When the dataset is synthetic you can specify the following options:

  • --num-prompts: The number of unique prompts to generate.

  • --synthetic-tokens-mean: The mean number of tokens of synthetic input data.

  • --synthetic-tokens-stddev: The standard deviation number of tokens of synthetic input data.

  • --synthetic-requested-output-tokens: The number of output tokens to ask the model to return in the response.

  • --random-seed: The seed used to generate random values.

When the dataset is coming from HuggingFace you can specify the following options:

  • --num-prompts: The number of unique prompts to generate.

  • --dataset: HuggingFace dataset to use for benchmarking.

Metrics#

GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.

Metric

Description

Aggregations

Time to First Token

Time between when a request is sent and when its first response is received, one value per request in benchmark

Avg, min, max, p99, p90, p75

Inter Token Latency

Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark

Avg, min, max, p99, p90, p75

Request Latency

Time between when a request is sent and when its final response is received, one value per request in benchmark

Avg, min, max, p99, p90, p75

Number of Output Tokens

Total number of output tokens of a request, one value per request in benchmark

Avg, min, max, p99, p90, p75

Output Token Throughput

Total number of output tokens from benchmark divided by benchmark duration

None–one value per benchmark

Request Throughput

Number of final responses from benchmark divided by the benchmark duration

None–one value per benchmark

CLI#

-h#

--help#

-v#

--verbose#

Enables verbose mode.

--version#

Prints the version and exits.

--prompt-source {dataset,synthetic}#

The source of the input prompts.

--input-dataset {openorca,cnn_dailymail}#

The HuggingFace dataset to use for prompts when prompt-source is dataset.

--synthetic-requested-output-tokens <int>#

The number of tokens to request in the output. This is used when prompt-source is synthetic to tell the LLM how many output tokens to generate in each response.

--synthetic-tokens-mean <int>#

The mean of the number of tokens of synthetic input data.

--synthetic-tokens-stddev <int>#

The standard deviation of number of tokens of synthetic input data.

-m <str>#

--model <str>#

The name of the model to benchmark.

--num-prompts <int>#

The number of unique prompts to generate as stimulus.

--backend {trtllm,vllm}#

When using the “triton” service-kind, this is the backend of the model.

--random-seed <int>#

Seed used to generate random values.

--concurrency <int>#

Sets the concurrency value to benchmark.

-p <int>#

--measurement-interval <int>#

Indicates the time interval used for each measurement in milliseconds. The perf analyzer will sample a time interval specified by -p and take measurement over the requests completed within that time interval.

The default value is 10000.

--profile-export-file <file>#

Specifies the path where the perf_analyzer profile export will be generated. By default, the profile export will be to profile_export.json. The genai-perf file will be exported to profile_export_file>_genai_perf.csv. For example, if the profile export file is profile_export.json, the genai-perf file will be exported to profile_export_genai_perf.csv.

--request-rate <float>#

Sets the request rate for the load generated by PA.

--service-kind {triton,openai}#

Describes the kind of service perf_analyzer will generate load for. The options are triton and openai. Note in order to use openai you must specify an endpoint via --endpoint.

The default value is triton.

-s <float>#

--stability-percentage <float>#

Indicates the allowed variation in latency measurements when determining if a result is stable. The measurement is considered as stable if the ratio of max / min from the recent 3 measurements is within (stability percentage) in terms of both infer per second and latency.

--streaming#

Enables the use of the streaming API.

--endpoint {v1/completions,v1/chat/completions}#

Describes what endpoint to send requests to on the server. This is required when using openai service-kind. This is ignored in other cases.

-u <url>#

--url <url>#

URL of the endpoint to target for benchmarking.

Known Issues#

  • GenAI-Perf can be slow to finish if a high request-rate is provided

  • Token counts may not be exact

  • Token output counts are much higher than reality for now when running on triton server, because the input is reflected back into the output