Using GenAI-Perf to Benchmark

NIM for LLM Benchmarking Guide (Latest)

NVIDIA GenAI-Perf is a client-side LLM-focused benchmarking tool, providing key metrics such as TTFT, ITL, TPS, RPS and more. It supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry. This section includes a step-by-step walkthrough, using GenAI-Perf to benchmark a Llama-3 model inference engine, powered by NVIDIA NIM.

NVIDIA NIM provides the easiest and quickest way to put a LLM into production. See the NIM LLM documentation to get started, beginning with hardware requirements, and setting your NVIDIA NGC API keys. For convenience, the following commands have been provided for deploying NIM and executing inference from the Getting Started Guide:

Copy
Copied!
            

# Choose a container name for bookkeeping export CONTAINER_NAME=llama3-8b-instruct # Choose a LLM NIM Image from NGC export IMG_NAME="nvcr.io/nim/meta/${CONTAINER_NAME}:1.0.0" # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" # Start the LLM NIM docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME

These examples use the Meta llama3-8b-instruct model, and use that name as the name of the container. The examples refer to and mount a local directory as a cache directory. During startup, the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup.

Copy
Copied!
            

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once up and running, NIM provide an OpenAI-compatible API that you can query, as shown in the following example.

Copy
Copied!
            

from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used") prompt = "Once upon a time" response = client.completions.create( model="meta/llama3-8b-instruct", prompt=prompt, max_tokens=16, stream=False ) completion = response.choices[0].text print(completion)

Once the NIM LLama-3 inference service running, you can set up a benchmarking tool. The easiest way to do this is using a pre-built docker container. We recommend starting a GenAI-perf container on the same server as NIM to avoid network latency, unless you specifically want to factor in the network latency as part of the measurement.

Note

Consult GenAI-Perf documentation for a comprehensive guide for getting started.

Run the following commands to use the pre-built container.

Copy
Copied!
            

export RELEASE="24.06" # recommend using latest releases in yy.mm format docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Once inside the container, you can start the genAI-perf evaluation harness as follows, which runs a warming load test on the NIM backend.

Copy
Copied!
            

export INPUT_SEQUENCE_LENGTH=200 export INPUT_SEQUENCE_STD=10 export OUTPUT_SEQUENCE_LENGTH=200 export CONCURRENCY=10 export MODEL=meta/llama3-8b-instruct genai-perf \ -m $MODEL \ --endpoint-type chat \ --service-kind openai \ --streaming \ -u localhost:8000 \ --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \ --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \ --concurrency $CONCURRENCY \ --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \ --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs ignore_eos:true \ --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \ -- \ -v \ --max-threads=256

This example specifies the input and output sequence length and a concurrency to test. It also tells the backend to ignore the special “EOS” tokens, so that the output reaches the intended length.

Note

This test will use the llama-3 tokenizer from HuggingFace, which is a guarded repository. You will need to apply for access, then login with your HF credential.

Copy
Copied!
            

pip install huggingface_hub huggingface-cli login

Note

See GenAI-perf documentation for the full set of options and parameters.

Upon successful execution, you should see the results similar to the following in the terminal:

image6.png

Figure 6. Sample output by genAI-perf.

Typically with benchmarking, a test would be set up to sweep over a number of use cases, such as input/output length combinations, and load scenarios, such as different concurrency values. Use the following bash script to define the parameters so that genAI-perf executes through all the combinations.

Note

Before doing a benchmarking sweep, it is recommended to run a warming-up test. In our case, this was performed in Step 2 previously.

Copy
Copied!
            

declare -A useCases # Populate the array with use case descriptions and their specified input/output lengths useCases["Translation"]="200/200" useCases["Text classification"]="200/5" useCases["Text summary"]="1000/200" # Function to execute genAI-perf with the input/output lengths as arguments runBenchmark() { local description="$1" local lengths="${useCases[$description]}" IFS='/' read -r inputLength outputLength <<< "$lengths" echo "Running genAI-perf for$descriptionwith input length$inputLengthand output length$outputLength" #Runs for concurrency in 1 2 5 10 50 100 250; do local INPUT_SEQUENCE_LENGTH=$inputLength local INPUT_SEQUENCE_STD=0 local OUTPUT_SEQUENCE_LENGTH=$outputLength local CONCURRENCY=$concurrency local MODEL=meta/llama3-8b-instruct genai-perf \ -m $MODEL \ --endpoint-type chat \ --service-kind openai \ --streaming \ -u localhost:8000 \ --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \ --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \ --concurrency $CONCURRENCY \ --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \ --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs ignore_eos:true \ --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \ --measurement-interval 10000 \ --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \ -- \ -v \ --max-threads=256 done } # Iterate over all defined use cases and run the benchmark script for each for description in "${!useCases[@]}"; do runBenchmark "$description" done

Note

The “–measurement-interval 10000” is the time interval used for each measurement in milliseconds. GenAI-perf measures the requests that finish in a specified time interval. Choose a value big enough for several requests to finish. For larger networks (e.g. Llama-3 70B) and more concurrency, e.g. 250, choose a higher value (e.g. 100000, which is 100s).

When tests complete, GenAI-perf generates the structured outputs in a default directory named “artifacts”, organized by model name, concurrency and input/output length. Your results should look similar to the following.

Copy
Copied!
            

artifacts ├── meta_llama3-8b-instruct-openai-chat-concurrency1 │ ├── 200_200.csv │ ├── 200_200_genai_perf.csv │ ├── 200_5.csv │ ├── 200_5_genai_perf.csv │ ├── all_data.gzip │ ├── llm_inputs.json │ ├── plots │ └── profile_export_genai_perf.json ├── meta_llama3-8b-instruct-openai-chat-concurrency10 │ ├── 200_200.csv │ ├── 200_200_genai_perf.csv │ ├── 200_5.csv │ ├── 200_5_genai_perf.csv │ ├── all_data.gzip │ ├── llm_inputs.json │ ├── plots │ └── profile_export_genai_perf.json ├── meta_llama3-8b-instruct-openai-chat-concurrency100 │ ├── 200_200.csv …

The “*genai_perf.csv” files contain the main benchmarking results. Using the following Python code snippet to parse a file for a given use case into a Pandas dataframe.

Copy
Copied!
            

import pandas as pd import io def parse_data(file_path): # Create a StringIO buffer buffer = io.StringIO() with open(file_path, 'rt') as file: for i, line in enumerate(file): if i not in [6,7]: buffer.write(line) # Make sure to reset the buffer's position to the beginning before reading buffer.seek(0) # Read the buffer into a pandas DataFrame df = pd.read_csv(buffer) return df

You can also read all the tokens-per-second and TTFT metrics across all concurrencies for a given use cases using the following bash script.

Copy
Copied!
            

import os root_dir = "./artifacts" directory_prefix = "meta_llama3-8b-instruct-openai-chat-concurrency" concurrencies = [1, 2, 5, 10, 50, 100, 250] TPS = [] TTFT = [] for con in concurrencies: df = parse_data(os.path.join(root_dir, directory_prefix+str(con), f"200_200_genai_perf.csv")) TPS.append(df.iloc[5]['avg']) TTFT.append(df.iloc[0]['avg']/1e9)

Finally, we can plot and analyze the latency-throughput curve using the collected data with the code below. Here, each data point corresponds to a concurrency value.

Copy
Copied!
            

import plotly.express as px fig = px.line(x=TTFT, y=TPS, text=concurrencies) fig.update_layout(xaxis_title="Single User: time to first token(s)", yaxis_title="Total System: tokens/s") fig.show()

The resulting plot using genAI-perf measurement data looks like the following.

image5.png

Figure 7: Latency-throughput curve plot using data generated by GenAI-perf.

The previous plot shows TTFT on the x-axis, total system throughput on the y-axis, and concurrencies on each dot. There are two ways to use the plot:

  1. An LLM application owner who has the latency budget, where the maximum TTFT that is acceptable, uses that value for x, and looks for the matching y value and the concurrencies. That shows the highest throughput that can be achieved with that latency limit and corresponding concurrency value.

  2. An LLM application owner can use the concurrency values to locate the dot on the graph. The x and y values that match show the latency and throughput for that concurrency level.

The plot also shows the concurrencies where latency grows quickly with little or no throughput gain. For example, in the plot above, concurrency=100 is one such value.

Similar plots can use ITL, e2e_latency, or TPS_per_user as X-axis, showing the trade-off between total system throughput and individual user latency.

Previous Parameters and Best Practices
Next Benchmarking LoRA Models
© Copyright © 2024, NVIDIA Corporation. Last updated on Jul 1, 2024.