NIM for LLM Benchmarking Guide
NIM for LLM Benchmarking Guide

Using GenAI-Perf to Benchmark

NVIDIA GenAI-Perf is a client-side LLM-focused benchmarking tool, providing key metrics such as TTFT, ITL, TPS, RPS and more. It supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry. This section includes a step-by-step walkthrough, using GenAI-Perf to benchmark a Llama-3 model inference engine, powered by NVIDIA NIM.

Use the following command to list the available NIMs, in CSV format.

Copy
Copied!
            

ngc registry image list --format_type csv nvcr.io/nim/*

This command should produce output in the following format:

Copy
Copied!
            

Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products <name1>,<repository1>,<latest tag1>,<image size1>,<updated date1>,<permission1>,<signed tag?1>,<access type1>,<associated products1> ... <nameN>,<repositoryN>,<latest tagN>,<image sizeN>,<updated dateN>,<permissionN>,<signed tag?N>,<access typeN>,<associated productsN>

Use the Repository field when you call the docker run command.

NVIDIA NIM provides the easiest and quickest way to put a LLM into production. See the NIM LLM documentation to get started, beginning with hardware requirements, and setting your NVIDIA NGC API keys. For convenience, the following commands have been provided for deploying NIM and executing inference from the Getting Started Guide:

Copy
Copied!
            

## Set Environment Variables export NGC_API_KEY=<value> # Choose a container name for bookkeeping export CONTAINER_NAME=llama3-8b-instruct # Choose a LLM NIM Image from NGC export IMG_NAME="nvcr.io/nim/${REPOSITORY}:latest" # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" # Start the LLM NIM docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME

These examples use the Meta llama3-8b-instruct model, and use that name as the name of the container. The examples refer to and mount a local directory as a cache directory. During startup, the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup.

Copy
Copied!
            

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once up and running, NIM provide an OpenAI-compatible API that you can query, as shown in the following example.

Copy
Copied!
            

from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used") prompt = "Once upon a time" response = client.completions.create( model="meta/llama3-8b-instruct", prompt=prompt, max_tokens=16, stream=False ) completion = response.choices[0].text print(completion)

Note

In our extensive benchmarking tests, we have observed that by specifying additional Docker flags, either –security-opt seccomp=unconfined (which disables the Seccomp security profile) or –privileged (which grants the container almost all the capabilities of the host machine, including direct access to hardware, device files, and certain kernel functionalities), inference performance can be further improved, up to 5% with the NIM TensorRT-LLM v0.10.0 backend, and up to 20% with the OSS vLLM (tested on v0.4.3) or the NIM vLLM backend (tested on NIM 1.0.0). This has been verified on DGX A100 and H100 systems, but potentially broadly applicable to other GPU systems. Disabling Seccomp or using privileged mode can eliminate some of the overhead associated with containerization security measures, allowing NIM and vLLM to utilize resources more efficiently. However, while there are performance benefits, these flags should be used with utmost diligence due to the elevated security vulnerabilities. See Docker documentation for further details.

Once the NIM LLama-3 inference service running, you can set up a benchmarking tool. The easiest way to do this is using a pre-built docker container. We recommend starting a GenAI-perf container on the same server as NIM to avoid network latency, unless you specifically want to factor in the network latency as part of the measurement.

Note

Consult GenAI-Perf documentation for a comprehensive guide for getting started.

Run the following commands to use the pre-built container.

Copy
Copied!
            

export RELEASE="24.06" # recommend using latest releases in yy.mm format export WORKDIR=<YOUR_GENAI_PERF_WORKING_DIRECTORY> docker run -it --net=host --gpus=all -v $WORKDIR:/workdir nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Once inside the container, you can start the genAI-perf evaluation harness as follows, which runs a warming load test on the NIM backend.

Copy
Copied!
            

export INPUT_SEQUENCE_LENGTH=200 export INPUT_SEQUENCE_STD=10 export OUTPUT_SEQUENCE_LENGTH=200 export CONCURRENCY=10 export MODEL=meta/llama3-8b-instruct cd /workdir genai-perf \ -m $MODEL \ --endpoint-type chat \ --service-kind openai \ --streaming \ -u localhost:8000 \ --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \ --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \ --concurrency $CONCURRENCY \ --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \ --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs ignore_eos:true \ --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \ -- \ -v \ --max-threads=256

This example specifies the input and output sequence length and a concurrency to test. It also tells the backend to ignore the special “EOS” tokens, so that the output reaches the intended length.

Note

This test will use the llama-3 tokenizer from HuggingFace, which is a guarded repository. You will need to apply for access, then login with your HF credential.

Copy
Copied!
            

pip install huggingface_hub huggingface-cli login

Note

See GenAI-perf documentation for the full set of options and parameters.

Upon successful execution, you should see the results similar to the following in the terminal:

image6.png

Figure 6. Sample output by genAI-perf.

Typically with benchmarking, a test would be set up to sweep over a number of use cases, such as input/output length combinations, and load scenarios, such as different concurrency values. Use the following bash script to define the parameters so that genAI-perf executes through all the combinations.

Note

Before doing a benchmarking sweep, it is recommended to run a warming-up test. In our case, this was performed in Step 3 previously.

Copy
Copied!
            

declare -A useCases # Populate the array with use case descriptions and their specified input/output lengths useCases["Translation"]="200/200" useCases["Text classification"]="200/5" useCases["Text summary"]="1000/200" # Function to execute genAI-perf with the input/output lengths as arguments runBenchmark() { local description="$1" local lengths="${useCases[$description]}" IFS='/' read -r inputLength outputLength <<< "$lengths" echo "Running genAI-perf for$descriptionwith input length$inputLengthand output length$outputLength" #Runs for concurrency in 1 2 5 10 50 100 250; do local INPUT_SEQUENCE_LENGTH=$inputLength local INPUT_SEQUENCE_STD=0 local OUTPUT_SEQUENCE_LENGTH=$outputLength local CONCURRENCY=$concurrency local MODEL=meta/llama3-8b-instruct genai-perf \ -m $MODEL \ --endpoint-type chat \ --service-kind openai \ --streaming \ -u localhost:8000 \ --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \ --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \ --concurrency $CONCURRENCY \ --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \ --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \ --extra-inputs ignore_eos:true \ --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \ --measurement-interval 10000 \ --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \ -- \ -v \ --max-threads=256 done } # Iterate over all defined use cases and run the benchmark script for each for description in "${!useCases[@]}"; do runBenchmark "$description" done

Save this script in a working directory, such as under /workdir/benchmark.sh. You can then execute it with the following command.

Copy
Copied!
            

cd /workdir bash benchmark.sh

Note

The “–measurement-interval 10000” is the time interval used for each measurement in milliseconds. GenAI-perf measures the requests that finish in a specified time interval. Choose a value big enough for several requests to finish. For larger networks (e.g. Llama-3 70B) and more concurrency, e.g. 250, choose a higher value (e.g. 100000, which is 100s).

When the tests complete, GenAI-perf generates the structured outputs in a default directory named “artifacts” under your mounted working directory (/workdir in these examples), organized by model name, concurrency and input/output length. Your results should look similar to the following.

Copy
Copied!
            

/workdir/artifacts ├── meta_llama3-8b-instruct-openai-chat-concurrency1 │ ├── 200_200.csv │ ├── 200_200_genai_perf.csv │ ├── 200_5.csv │ ├── 200_5_genai_perf.csv │ ├── all_data.gzip │ ├── llm_inputs.json │ ├── plots │ └── profile_export_genai_perf.json ├── meta_llama3-8b-instruct-openai-chat-concurrency10 │ ├── 200_200.csv │ ├── 200_200_genai_perf.csv │ ├── 200_5.csv │ ├── 200_5_genai_perf.csv │ ├── all_data.gzip │ ├── llm_inputs.json │ ├── plots │ └── profile_export_genai_perf.json ├── meta_llama3-8b-instruct-openai-chat-concurrency100 │ ├── 200_200.csv …

The “*genai_perf.csv” files contain the main benchmarking results. Using the following Python code snippet to parse a file for a given use case into a Pandas dataframe.

Copy
Copied!
            

import pandas as pd import io def parse_data(file_path): # Create a StringIO buffer buffer = io.StringIO() with open(file_path, 'rt') as file: for i, line in enumerate(file): if i not in [6,7]: buffer.write(line) # Make sure to reset the buffer's position to the beginning before reading buffer.seek(0) # Read the buffer into a pandas DataFrame df = pd.read_csv(buffer) return df

You can also read all the tokens-per-second and TTFT metrics across all concurrencies for a given use cases using the following bash script.

Copy
Copied!
            

import os root_dir = "./artifacts" directory_prefix = "meta_llama3-8b-instruct-openai-chat-concurrency" concurrencies = [1, 2, 5, 10, 50, 100, 250] TPS = [] TTFT = [] for con in concurrencies: df = parse_data(os.path.join(root_dir, directory_prefix+str(con), f"200_200_genai_perf.csv")) TPS.append(df.iloc[5]['avg']) TTFT.append(df.iloc[0]['avg']/1e9)

Finally, we can plot and analyze the latency-throughput curve using the collected data with the code below. Here, each data point corresponds to a concurrency value.

Copy
Copied!
            

import plotly.express as px fig = px.line(x=TTFT, y=TPS, text=concurrencies) fig.update_layout(xaxis_title="Single User: time to first token(s)", yaxis_title="Total System: tokens/s") fig.show()

The resulting plot using genAI-perf measurement data looks like the following.

image5.png

Figure 7: Latency-throughput curve plot using data generated by GenAI-perf.

The previous plot shows TTFT on the x-axis, total system throughput on the y-axis, and concurrencies on each dot. There are two ways to use the plot:

  1. An LLM application owner who has the latency budget, where the maximum TTFT that is acceptable, uses that value for x, and looks for the matching y value and the concurrencies. That shows the highest throughput that can be achieved with that latency limit and corresponding concurrency value.

  2. An LLM application owner can use the concurrency values to locate the dot on the graph. The x and y values that match show the latency and throughput for that concurrency level.

The plot also shows the concurrencies where latency grows quickly with little or no throughput gain. For example, in the plot above, concurrency=100 is one such value.

Similar plots can use ITL, e2e_latency, or TPS_per_user as X-axis, showing the trade-off between total system throughput and individual user latency.

Previous Parameters and Best Practices
Next Benchmarking LoRA Models
© Copyright © 2024, NVIDIA Corporation. Last updated on Oct 15, 2024.