Performance#
You can use the perf_analyzer
tool to benchmark the performance of the NVIDIA NIM for Visual Generative AI. perf_analyzer
is pre-installed in the NVIDIA Triton Inference Server SDK container.
Procedure#
Create the directory input_dir
and add a json file with an example payload
mkdir input_dir
echo '{
"data": [
{
"payload": [
{
"prompt": "A simple coffee shop interior",
"mode": "base",
"seed": 0,
"steps": 50
}
]
}
]
}' > input_dir/input.json
mkdir input_dir
input_image_path="input.jpg"
# download an example image
curl https://assets.ngc.nvidia.com/products/api-catalog/flux/input/1.jpg > $input_image_path
image_b64=$(base64 -w 0 $input_image_path)
echo '{
"data": [
{
"payload": [
{
"prompt": "A simple coffee shop interior",
"mode: "canny",
"image": "data:image/png;base64,'${image_b64}'",
"preprocess_image": true,
"seed": 0,
"steps": 50
}
]
}
]
}' > input_dir/input.json
mkdir input_dir
input_image_path="input.jpg"
# download an example image
curl https://assets.ngc.nvidia.com/products/api-catalog/flux/input/1.jpg > $input_image_path
image_b64=$(base64 -w 0 $input_image_path)
echo '{
"data": [
{
"payload": [
{
"prompt": "A simple coffee shop interior",
"mode: "depth",
"image": "data:image/png;base64,'${image_b64}'",
"preprocess_image": true,
"seed": 0,
"steps": 50
}
]
}
]
}' > input_dir/input.json
The preceding payload would be used for all inference calls. The parametes that influance the performance are steps that represent the number of diffusion steps to run for all variants and preprocess_image indicate whether or not to convert an inpute image to canny edges or depth map according to the mode.
The description of all API parameters could be found in API Reference.
mkdir input_dir
echo '{
"data": [
{
"payload": [
{
"prompt": "A simple coffee shop interior",
"seed": 0,
"steps": 4
}
]
}
]
}' > input_dir/input.json
The description of all API parameters could be found in API Reference.
Use the following example to run the Triton Inference Server SDK docker container, mounting the directories input_dir
and output_dir
.
export RELEASE="24.09" docker run -it --rm --name=performance_benchmark \ --runtime=nvidia \ --network="host" \ -v $(pwd)/input_dir:/input_dir \ -v $(pwd)/output_dir:/output_dir \ --entrypoint perf_analyzer \ nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk \ -m flux.1-dev \ -u http://localhost:8000 --endpoint v1/infer \ --async --service-kind openai -i http \ --input-data /input_dir/input.json \ --profile-export-file /output_dir/profile_export_flux.1-dev.json \ -f /output_dir/latency_report.csv \ --verbose \ --verbose-csv \ --warmup-request-count 3 \ --request-count 10 \ --concurrency-range 1export RELEASE="24.09" docker run -it --rm --name=performance_benchmark \ --runtime=nvidia \ --network="host" \ -v $(pwd)/input_dir:/input_dir \ -v $(pwd)/output_dir:/output_dir \ --entrypoint perf_analyzer \ nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk \ -m flux.1-schnell \ -u http://localhost:8000 --endpoint v1/infer \ --async --service-kind openai -i http \ --input-data /input_dir/input.json \ --profile-export-file /output_dir/profile_export_flux.1-schnell.json \ -f /output_dir/latency_report.csv \ --verbose \ --verbose-csv \ --warmup-request-count 3 \ --request-count 10 \ --concurrency-range 1
The perf_analyzer tool creates two files in $(pwd)/output_dir
. The profile_export_{model-name}.json
file includes detailed results for each request.
The latency_report.csv
file includes the average and percentile latency numbers in microseconds. Divide the average latency value by 1000000 to get images per second.
Perf Analyzer Measurement Parameters#
Parameter |
Description |
---|---|
|
a total number of requests to use for measurement |
|
a number of warmup requests to send before benchmarking |
|
a range of concurrency levels covered by Perf Analyzer. Perf Analyzer will start from the concurrency level of ‘start’ and go until ‘end’ with a stride of ‘step’. |
You can see the full set of command-line options for perf_analyzer
in the Command Line Options section of the documentation.