Using GenAI-Perf to Benchmark#

NVIDIA GenAI-Perf is a client-side LLM-focused benchmarking tool, providing key metrics such as TTFT, ITL, TPS, RPS and more. It supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry. This section includes a step-by-step walkthrough, using GenAI-Perf to benchmark a Llama-3 model inference engine, powered by NVIDIA NIM.

Step 1. Get a list of the latest models#

Use the following command to list the available NIMs, in CSV format.

ngc registry image list --format_type csv nvcr.io/nim/*

This command should produce output in the following format:

Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products
<name1>,<repository1>,<latest tag1>,<image size1>,<updated date1>,<permission1>,<signed tag?1>,<access type1>,<associated products1>
...
<nameN>,<repositoryN>,<latest tagN>,<image sizeN>,<updated dateN>,<permissionN>,<signed tag?N>,<access typeN>,<associated productsN>

Use the Repository field when you call the docker run command. Use the following command, where REPO is one of the Repository fields, to get a list of the versions of the model.

ngc registry image info --format_type ascii nvcr.io/${REPO}

This command should produce a Tags section like the following, indicating there are 1.0 and 1.1 versions of the model:

Tags:
    latest
    1.1
    1.0

Usually the latest tag sigiifies the latest version, 1.1 in this case. Together you can use the Repository and tag version to access a specific model.

Step 2. Setting Up an OpenAI-Compatible LLama-3 Inference Service with NVIDIA NIM#

NVIDIA NIM provides the easiest and quickest way to put a LLM into production. See the NIM LLM documentation to get started, beginning with hardware requirements, and setting your NVIDIA NGC API keys. For convenience, the following commands have been provided for deploying NIM and executing inference from the Getting Started Guide:

## Set Environment Variables
export NGC_API_KEY=<value>

# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${REPOSITORY}:${TAG}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

These examples use the Meta llama3-8b-instruct model, and use that name as the name of the container. The examples refer to and mount a local directory as a cache directory. During startup, the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup.

INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once up and running, NIM provide an OpenAI-compatible API that you can query, as shown in the following example.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
prompt = "Once upon a time"
response = client.completions.create(
    model="meta/llama3-8b-instruct",
    prompt=prompt,
    max_tokens=16,
    stream=False
)
completion = response.choices[0].text
print(completion)

Note

In our extensive benchmarking tests, we have observed that by specifying additional Docker flags, either –security-opt seccomp=unconfined (which disables the Seccomp security profile) or –privileged (which grants the container almost all the capabilities of the host machine, including direct access to hardware, device files, and certain kernel functionalities), inference performance can be further improved, up to 5% with the NIM TensorRT-LLM v0.10.0 backend, and up to 20% with the OSS vLLM (tested on v0.4.3) or the NIM vLLM backend (tested on NIM 1.0.0). This has been verified on DGX A100 and H100 systems, but potentially broadly applicable to other GPU systems. Disabling Seccomp or using privileged mode can eliminate some of the overhead associated with containerization security measures, allowing NIM and vLLM to utilize resources more efficiently. However, while there are performance benefits, these flags should be used with utmost diligence due to the elevated security vulnerabilities. See Docker documentation for further details.

Step 3. Setting Up GenAI-Perf and Warming Up: Benchmarking a Single Use Case#

Once the NIM LLama-3 inference service running, you can set up a benchmarking tool. The easiest way to do this is using a pre-built docker container. We recommend starting a GenAI-perf container on the same server as NIM to avoid network latency, unless you specifically want to factor in the network latency as part of the measurement.

Note

Consult GenAI-Perf documentation for a comprehensive guide for getting started.

Run the following commands to use the pre-built container.

export RELEASE="24.06" # recommend using latest releases in yy.mm format
export WORKDIR=<YOUR_GENAI_PERF_WORKING_DIRECTORY>

docker run -it --net=host --gpus=all -v $WORKDIR:/workdir nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Once inside the container, you can start the genAI-perf evaluation harness as follows, which runs a warming load test on the NIM backend.

export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=10
export OUTPUT_SEQUENCE_LENGTH=200
export CONCURRENCY=10
export MODEL=meta/llama3-8b-instruct

cd /workdir
genai-perf profile \
    -m $MODEL \
    --endpoint-type chat \
    --service-kind openai \
    --streaming \
    -u localhost:8000 \
    --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
    --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
    --concurrency $CONCURRENCY \
    --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs ignore_eos:true \
    --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
    -- \
    -v \
    --max-threads=256

This example specifies the input and output sequence length and a concurrency to test. It also tells the backend to ignore the special “EOS” tokens, so that the output reaches the intended length.

Note

This test will use the llama-3 tokenizer from HuggingFace, which is a guarded repository. You will need to apply for access, then login with your HF credential.

pip install huggingface_hub
huggingface-cli login

Note

See GenAI-perf documentation for the full set of options and parameters.

Upon successful execution, you should see the results similar to the following in the terminal:

Figure 6. Sample output by genAI-perf.#

Step 4. Sweeping through a Number of Use Cases#

Typically with benchmarking, a test would be set up to sweep over a number of use cases, such as input/output length combinations, and load scenarios, such as different concurrency values. Use the following bash script to define the parameters so that genAI-perf executes through all the combinations.

Note

Before doing a benchmarking sweep, it is recommended to run a warming-up test. In our case, this was performed in Step 3 previously.

declare -A useCases

# Populate the array with use case descriptions and their specified input/output lengths
useCases["Translation"]="200/200"
useCases["Text classification"]="200/5"
useCases["Text summary"]="1000/200"

# Function to execute genAI-perf with the input/output lengths as arguments
runBenchmark() {
    local description="$1"
    local lengths="${useCases[$description]}"
    IFS='/' read -r inputLength outputLength <<< "$lengths"

    echo "Running genAI-perf for $description with input length $inputLength and output length $outputLength"
    #Runs
    for concurrency in 1 2 5 10 50 100 250; do

        local INPUT_SEQUENCE_LENGTH=$inputLength
        local INPUT_SEQUENCE_STD=0
        local OUTPUT_SEQUENCE_LENGTH=$outputLength
        local CONCURRENCY=$concurrency
        local MODEL=meta/llama3-8b-instruct

        genai-perf profile \
            -m $MODEL \
            --endpoint-type chat \
            --service-kind openai \
            --streaming \
            -u localhost:8000 \
            --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
            --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
            --concurrency $CONCURRENCY \
            --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
            --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
            --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
            --extra-inputs ignore_eos:true \
            --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
            --measurement-interval 10000 \
            --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \
            -- \
            -v \
            --max-threads=256

    done
}

# Iterate over all defined use cases and run the benchmark script for each
for description in "${!useCases[@]}"; do
    runBenchmark "$description"
done

Save this script in a working directory, such as under /workdir/benchmark.sh. You can then execute it with the following command.

cd /workdir
bash benchmark.sh

Note

The “–measurement-interval 10000” is the time interval used for each measurement in milliseconds. GenAI-perf measures the requests that finish in a specified time interval. Choose a value big enough for several requests to finish. For larger networks (e.g. Llama-3 70B) and more concurrency, e.g. 250, choose a higher value (e.g. 100000, which is 100s).

Step 5. Analyzing the Output#

When the tests complete, GenAI-perf generates the structured outputs in a default directory named “artifacts” under your mounted working directory (/workdir in these examples), organized by model name, concurrency and input/output length. Your results should look similar to the following.

/workdir/artifacts
├── meta_llama3-8b-instruct-openai-chat-concurrency1
│   ├── 200_200.csv
│   ├── 200_200_genai_perf.csv
│   ├── 200_5.csv
│   ├── 200_5_genai_perf.csv
│   ├── all_data.gzip
│   ├── llm_inputs.json
│   ├── plots
│   └── profile_export_genai_perf.json
├── meta_llama3-8b-instruct-openai-chat-concurrency10
│   ├── 200_200.csv
│   ├── 200_200_genai_perf.csv
│   ├── 200_5.csv
│   ├── 200_5_genai_perf.csv
│   ├── all_data.gzip
│   ├── llm_inputs.json
│   ├── plots
│   └── profile_export_genai_perf.json
├── meta_llama3-8b-instruct-openai-chat-concurrency100
│   ├── 200_200.csv
…

The “*genai_perf.csv” files contain the main benchmarking results. Using the following Python code snippet to parse a file for a given use case into a Pandas dataframe.

import pandas as pd
import io

def parse_data(file_path):
    # Create a StringIO buffer
    buffer = io.StringIO()
    with open(file_path, 'rt') as file:
        for i, line in enumerate(file):
            if i not in [6,7]:
                buffer.write(line)
    # Make sure to reset the buffer's position to the beginning before reading
    buffer.seek(0)

    # Read the buffer into a pandas DataFrame
    df = pd.read_csv(buffer)
    return df

You can also read all the tokens-per-second and TTFT metrics across all concurrencies for a given use cases using the following bash script.

import os

root_dir = "./artifacts"
directory_prefix = "meta_llama3-8b-instruct-openai-chat-concurrency"

concurrencies = [1, 2, 5, 10, 50, 100, 250]
TPS = []
TTFT = []

for con in concurrencies:
    df = parse_data(os.path.join(root_dir, directory_prefix+str(con), f"200_200_genai_perf.csv"))
    TPS.append(df.iloc[5]['avg'])
    TTFT.append(df.iloc[0]['avg']/1e9)

Finally, we can plot and analyze the latency-throughput curve using the collected data with the code below. Here, each data point corresponds to a concurrency value.

import plotly.express as px

fig = px.line(x=TTFT, y=TPS, text=concurrencies)
fig.update_layout(xaxis_title="Single User: time to first token(s)", yaxis_title="Total System: tokens/s")
fig.show()

The resulting plot using genAI-perf measurement data looks like the following.

Figure 7: Latency-throughput curve plot using data generated by GenAI-perf.#

Step 6. Interpreting the Results#

The previous plot shows TTFT on the x-axis, total system throughput on the y-axis, and concurrencies on each dot. There are two ways to use the plot:

An LLM application owner who has the latency budget, where the maximum TTFT that is acceptable, uses that value for x, and looks for the matching y value and the concurrencies. That shows the highest throughput that can be achieved with that latency limit and corresponding concurrency value.

An LLM application owner can use the concurrency values to locate the dot on the graph. The x and y values that match show the latency and throughput for that concurrency level.

The plot also shows the concurrencies where latency grows quickly with little or no throughput gain. For example, in the plot above, concurrency=100 is one such value.

Similar plots can use ITL, e2e_latency, or TPS_per_user as X-axis, showing the trade-off between total system throughput and individual user latency.