Performance#
Important
A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.
Performance Profiling#
You can use the genai-perf
tool from
perf_analyzer to
benchmark the performance of VLMs under simulated production load. A quick start
guide for genai-perf
can be found
here.
First, prepare a directory for the model’s tokenizer. genai-perf
uses it to calculate statistics related to output sequence length.
Important
We recommend setting TOKENIZER_PATH
to the model’s checkpoint directory pulled from HuggingFace (e.g., this directory for meta/llama-3.2-11b-vision-instruct
).
For models not available in HuggingFace, please use the Create Model Store utility. You can find an example for nemoretriever-parse below.
The directory must include the model’s
config.json
file and all files relevant to the tokenizer (i.e.,
tokenizer.json
, tokenizer_config.json
, and special_tokens_map.json
).
See the following example on how to use the Create Model Store utility to set up a nemoretriever-parse tokenizer for genai-perf
:
export IMG_NAME=nvcr.io/nim/nvidia/nemoretriever-parse:latest # image of the container under test
export TOKENIZER_PATH=~/.cache/nim/tokenizer # the place where the tokenizer files will be located
mkdir -p $TOKENIZER_PATH
# get one of available profiles
PROFILE=$(docker run -it --rm \
-e NGC_API_KEY \
$IMG_NAME \
list-model-profiles | grep -oE '[0-9a-f]{64}' | tail -1)
# create model store
docker run -it --rm \
-e NGC_API_KEY \
-v "$TOKENIZER_PATH:/opt/nim/.cache" \
-u $(id -u) \
$IMG_NAME \
create-model-store -m /opt/nim/.cache -p $PROFILE
Now, run the tritonserver
container:
export TOKENIZER_PATH=... # this is where tokenizer.json for the model is located
docker run -it --net=host -v ${TOKENIZER_PATH}:/workspace/tokenizer nvcr.io/nvidia/tritonserver:24.10-py3-sdk
In the container, install the version of genai-perf
supporting all the containers we offer, and launch the benchmark:
Important
The following command has been tested using the 6f7c328c27dafbde62207456fb7f8366e422ee76 version of genai-perf
.
Important
For the nemoretriever-parse model, the OUTPUT_SEQ_LEN
can not be greater than 3579.
git clone -b main --single-branch https://github.com/triton-inference-server/perf_analyzer.git /perf_analyzer
cd /perf_analyzer
git checkout 6f7c328c27dafbde62207456fb7f8366e422ee76
pip uninstall -y genai-perf
pip install /perf_analyzer/genai-perf
export MODEL=meta/llama-3.2-11b-vision-instruct # the name of the model under test
export CONCURRENCY=8 # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000 # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=2000 # output sequence length (in the number of tokens) - ignored for nemoretriever-parse
export OUTPUT_SEQ_LEN=2000 # output sequence length (in the number of tokens)
export IMAGE_WIDTH=2048 # width of images used in profiling
export IMAGE_HEIGHT=1648 # height of images used in profiling
export STREAMING=--streaming # height of images used in profiling
if [[ "$MODEL" == nvidia/nemoretriever-parse ]];
then
export INPUT_SEQ_LEN=0
export STREAMING=''
export NEMORETRIEVER_PARSE_EXTRA=( "--extra-inputs" '{"tools": [{"type": "function", "function": {"name": "evaluation_markdown_bbox"}}]}' )
fi
genai-perf profile \
-m ${MODEL} \
--concurrency ${CONCURRENCY} \
--tokenizer /workspace/tokenizer \
--endpoint v1/chat/completions \
--endpoint-type vision \
--service-kind openai \
${STREAMING} -u http://127.0.0.1:8000 \
--num-prompts 100 \
--measurement-interval ${MEASUREMENT_INTERVAL} \
--synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean ${OUTPUT_SEQ_LEN} \
--extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
--extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
--extra-inputs ignore_eos:true \
"${NEMORETRIEVER_PARSE_EXTRA[@]}" \
--artifact-dir tmp/ \
-v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
-- --max-threads ${CONCURRENCY}
The full documentation for genai-perf
can be found here.
Benchmarking Results#
ISL: Text input sequence length; OSL: Text output sequence length; TTFT: Time to first token; ITL: Inter-token latency
Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
198 |
7.4 |
7598 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
29.3 |
2.72 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
202 |
8.4 |
8602 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
28.1 |
2.04 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
367 |
13.7 |
14067 |
20000 |
2000 |
1120x1120 |
2690 |
15.5 |
33690 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
55.1 |
1.04 |
20000 |
2000 |
1120x1120 |
41.0 |
0.11 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
531 |
9.7 |
10231 |
20000 |
2000 |
1120x1120 |
1839 |
10.7 |
23239 |
60000 |
2000 |
1120x1120 |
6830 |
12.4 |
31630 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
53.3 |
1.59 |
20000 |
2000 |
1120x1120 |
55.1 |
0.24 |
60000 |
2000 |
1120x1120 |
63.3 |
0.07 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
60000 |
2000 |
1120x1120 |
7421 |
12.6 |
32621 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
60000 |
2000 |
1120x1120 |
82.7 |
0.06 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
1142 |
14.9 |
16042 |
20000 |
2000 |
1120x1120 |
4626 |
16.5 |
37626 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
24.3 |
0.32 |
20000 |
2000 |
1120x1120 |
44.7 |
0.08 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
531 |
17.4 |
17931 |
20000 |
2000 |
1120x1120 |
2119 |
18.2 |
38519 |
60000 |
2000 |
1120x1120 |
6347 |
19.0 |
44347 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
48.3 |
1.17 |
60000 |
2000 |
1120x1120 |
48.5 |
0.08 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
60000 |
2000 |
1120x1120 |
8404 |
25.2 |
58804 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
60000 |
2000 |
1120x1120 |
32.0 |
0.03 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
1060 |
36.6 |
37660 |
20000 |
2000 |
1120x1120 |
7120 |
38.0 |
83120 |
60000 |
2000 |
1120x1120 |
26692 |
41.1 |
108892 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
77.5 |
0.39 |
20000 |
2000 |
1120x1120 |
76.4 |
0.05 |
60000 |
2000 |
1120x1120 |
75.5 |
0.02 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
1121 |
26.9 |
28021 |
20000 |
2000 |
1120x1120 |
5081 |
27.5 |
60081 |
60000 |
2000 |
1120x1120 |
17640 |
29.4 |
76440 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
94.6 |
0.59 |
20000 |
2000 |
1120x1120 |
102.6 |
0.11 |
60000 |
2000 |
1120x1120 |
114.8 |
0.04 |
ISL |
OSL |
Image Size (width x height in px) |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|---|
1000 |
1000 |
1120x1120 |
1102 |
40.4 |
41502 |
ISL |
OSL |
Image Size (width x height in px) |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|---|
1000 |
1000 |
1120x1120 |
66.7 |
0.12 |
ISL |
OSL |
Image Size (width x height in px) |
Avg Request Latency (msec) |
---|---|---|---|
0 |
200 |
1648x2048 |
494 |
0 |
1000 |
1648x2048 |
1320 |
0 |
2000 |
1648x2048 |
2388 |
0 |
3579 |
1648x2048 |
4104 |
ISL |
OSL |
Image Size (width x height in px) |
Request Throughput (reqs/sec) |
---|---|---|---|
0 |
200 |
1648x2048 |
5.07 |
0 |
1000 |
1648x2048 |
3.75 |
0 |
2000 |
1648x2048 |
2.75 |
0 |
3579 |
1648x2048 |
1.80 |
ISL |
OSL |
Image Size (width x height in px) |
Avg Request Latency (msec) |
---|---|---|---|
0 |
200 |
1648x2048 |
828 |
0 |
1000 |
1648x2048 |
1872 |
0 |
2000 |
1648x2048 |
3403 |
0 |
3579 |
1648x2048 |
5852 |
ISL |
OSL |
Image Size (width x height in px) |
Request Throughput (reqs/sec) |
---|---|---|---|
0 |
200 |
1648x2048 |
1.54 |
0 |
1000 |
1648x2048 |
2.1 |
0 |
2000 |
1648x2048 |
1.6 |
0 |
3579 |
1648x2048 |
1.1 |
ISL |
OSL |
Image Size (width x height in px) |
Avg Request Latency (msec) |
---|---|---|---|
0 |
200 |
1648x2048 |
710 |
0 |
1000 |
1648x2048 |
1822 |
0 |
2000 |
1648x2048 |
3244 |
0 |
3579 |
1648x2048 |
5617 |
ISL |
OSL |
Image Size (width x height in px) |
Request Throughput (reqs/sec) |
---|---|---|---|
0 |
200 |
1648x2048 |
2.6 |
0 |
1000 |
1648x2048 |
1.7 |
0 |
2000 |
1648x2048 |
1.1 |
0 |
3579 |
1648x2048 |
0.7 |
Not all performance numbers are available above. Additional data will be added to this page over time.