Performance#

You can use the perf_analyzer tool to benchmark the performance of the NVIDIA NIM for Visual Generative AI. perf_analyzer is pre-installed in the NVIDIA Triton Inference Server SDK container.

Procedure#

Create the directory input_dir and add a json file with an example payload

mkdir input_dir

echo '{
    "data": [
        {
            "payload": [
                {
                    "prompt": "A simple coffee shop interior",
                    "mode": "base",
                    "seed": 0,
                    "steps": 50
                }
            ]
        }
    ]
}' > input_dir/input.json
mkdir input_dir

input_image_path="input.jpg"
# download an example image
curl https://assets.ngc.nvidia.com/products/api-catalog/flux/input/1.jpg > $input_image_path
image_b64=$(base64 -w 0 $input_image_path)
echo '{
    "data": [
        {
            "payload": [
                {
                    "prompt": "A simple coffee shop interior",
                    "mode: "canny",
                    "image": "data:image/png;base64,'${image_b64}'",
                    "preprocess_image": true,
                    "seed": 0,
                    "steps": 50
                }
            ]
        }
    ]
}' > input_dir/input.json
mkdir input_dir

input_image_path="input.jpg"
# download an example image
curl https://assets.ngc.nvidia.com/products/api-catalog/flux/input/1.jpg > $input_image_path
image_b64=$(base64 -w 0 $input_image_path)
echo '{
    "data": [
        {
            "payload": [
                {
                    "prompt": "A simple coffee shop interior",
                    "mode: "depth",
                    "image": "data:image/png;base64,'${image_b64}'",
                    "preprocess_image": true,
                    "seed": 0,
                    "steps": 50
                }
            ]
        }
    ]
}' > input_dir/input.json

The preceding payload would be used for all inference calls. The parametes that influance the performance are steps that represent the number of diffusion steps to run for all variants and preprocess_image indicate whether or not to convert an inpute image to canny edges or depth map according to the mode.

The description of all API parameters could be found in API Reference.

mkdir input_dir

echo '{
    "data": [
        {
            "payload": [
                {
                    "prompt": "A simple coffee shop interior",
                    "seed": 0,
                    "steps": 4
                }
            ]
        }
    ]
}' > input_dir/input.json

The description of all API parameters could be found in API Reference.

Use the following example to run the Triton Inference Server SDK docker container, mounting the directories input_dir and output_dir.

export RELEASE="24.09"

docker run -it --rm --name=performance_benchmark \
    --runtime=nvidia \
    --network="host" \
    -v $(pwd)/input_dir:/input_dir \
    -v $(pwd)/output_dir:/output_dir \
    --entrypoint perf_analyzer \
    nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk \
    -m flux.1-dev \
    -u http://localhost:8000 --endpoint v1/infer \
    --async --service-kind openai -i http \
    --input-data /input_dir/input.json \
    --profile-export-file /output_dir/profile_export_flux.1-dev.json \
    -f /output_dir/latency_report.csv \
    --verbose \
    --verbose-csv \
    --warmup-request-count 3 \
    --request-count 10 \
    --concurrency-range 1
export RELEASE="24.09"

docker run -it --rm --name=performance_benchmark \
    --runtime=nvidia \
    --network="host" \
    -v $(pwd)/input_dir:/input_dir \
    -v $(pwd)/output_dir:/output_dir \
    --entrypoint perf_analyzer \
    nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk \
    -m flux.1-schnell \
    -u http://localhost:8000 --endpoint v1/infer \
    --async --service-kind openai -i http \
    --input-data /input_dir/input.json \
    --profile-export-file /output_dir/profile_export_flux.1-schnell.json \
    -f /output_dir/latency_report.csv \
    --verbose \
    --verbose-csv \
    --warmup-request-count 3 \
    --request-count 10 \
    --concurrency-range 1

The perf_analyzer tool creates two files in $(pwd)/output_dir. The profile_export_{model-name}.json file includes detailed results for each request. The latency_report.csv file includes the average and percentile latency numbers in microseconds. Divide the average latency value by 1000000 to get images per second.

Perf Analyzer Measurement Parameters#

Parameter

Description

--request-count N

a total number of requests to use for measurement

--warmup-request-count N

a number of warmup requests to send before benchmarking

--concurrency-range <start:end:step>

a range of concurrency levels covered by Perf Analyzer. Perf Analyzer will start from the concurrency level of ‘start’ and go until ‘end’ with a stride of ‘step’.

You can see the full set of command-line options for perf_analyzer in the Command Line Options section of the documentation.