Performance
Important
A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.
Performance Profiling
You can use the genai-perf
tool from perf_analyzer to benchmark the performance of VLMs under simulated production load.
Important
Update the model name according to your requirements. For example, for a meta/llama-3.2-11b-vision-instruct
model, you can use the following command:
Execute the following command to run a performance benchmark using the genai-perf
command line tool. A quick start guide for genai-perf
can be found here.
First, run the container with genai-perf
installed:
export TOKENIZER_PATH=... # this is where tokenizer.json for the model is located
docker run -it --net=host --gpus=all -v ${TOKENIZER_PATH}:/workspace/tokenizer nvcr.io/nvidia/tritonserver:24.10-py3-sdk
Important
We recommend setting TOKENIZER_PATH
to the model’s checkpoint directory pulled from HuggingFace (e.g., this for meta/llama-3.2-11b-vision-instruct
). The directory must include the model’s config.json
and all files relevant to the tokenizer (i.e., tokenizer.json
, tokenizer_config.json
, and special_tokens_map.json
).
Important
The following command has been tested using the genai-perf
24.10 release.
export CONCURRENCY=8 # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000 # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=5000 # input seqence length (in the number of tokens)
export OUTPUT_SEQ_LEN=5000 # output sequence length (in the number of tokens)
export IMAGE_WIDTH=1120 # width of images used in profiling
export IMAGE_HEIGHT=1120 # height of images used in profiling
genai-perf profile \
-m meta/llama-3.2-11b-vision-instruct \
--concurrency ${CONCURRENCY} \
--tokenizer /workspace/tokenizer \
--endpoint v1/chat/completions \
--endpoint-type vision \
--service-kind openai \
--streaming -u http://127.0.0.1:8000 \
--num-prompts 100 \
--measurement-interval ${MEASUREMENT_INTERVAL} \
--synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean ${OUTPUT_SEQ_LEN} \
--extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
--extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
--extra-inputs ignore_eos:true \
--artifact-dir tmp/ \
-v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
-- --max-threads ${CONCURRENCY}
The full documentation for genai-perf
can be found here.
Benchmarking Results
ISL: Input sequence length; OSL: Output sequence length; TTFT: Time to first token; ITL: Inter-token latency
Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
198 |
7.4 |
7598 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
29.3 |
2.72 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
202 |
8.4 |
8602 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
28.1 |
2.04 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
367 |
13.7 |
14067 |
20000 |
2000 |
2690 |
15.5 |
33690 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
55.1 |
1.04 |
20000 |
2000 |
41.0 |
0.11 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
531 |
9.7 |
10231 |
20000 |
2000 |
1839 |
10.7 |
23239 |
60000 |
2000 |
6830 |
12.4 |
31630 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
53.3 |
1.59 |
20000 |
2000 |
55.1 |
0.24 |
60000 |
2000 |
63.3 |
0.07 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
60000 |
2000 |
7421 |
12.6 |
32621 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
60000 |
2000 |
82.7 |
0.06 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
1142 |
14.9 |
16042 |
20000 |
2000 |
4626 |
16.5 |
37626 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
24.3 |
0.32 |
20000 |
2000 |
44.7 |
0.08 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
531 |
17.4 |
17931 |
20000 |
2000 |
2119 |
18.2 |
38519 |
60000 |
2000 |
6347 |
19.0 |
44347 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
48.3 |
1.17 |
60000 |
2000 |
48.5 |
0.08 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
60000 |
2000 |
8404 |
25.2 |
58804 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
60000 |
2000 |
32.0 |
0.03 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
1060 |
36.6 |
37660 |
20000 |
2000 |
7120 |
38.0 |
83120 |
60000 |
2000 |
26692 |
41.1 |
108892 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
77.5 |
0.39 |
20000 |
2000 |
76.4 |
0.05 |
60000 |
2000 |
75.5 |
0.02 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
1121 |
26.9 |
28021 |
20000 |
2000 |
5081 |
27.5 |
60081 |
60000 |
2000 |
17640 |
29.4 |
76440 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
94.6 |
0.59 |
20000 |
2000 |
102.6 |
0.11 |
60000 |
2000 |
114.8 |
0.04 |
ISL |
OSL |
Avg TTFT (msec) |
Avg ITL (msec) |
Avg Request Latency (msec) |
---|---|---|---|---|
1000 |
1000 |
1102 |
40.4 |
41502 |
ISL |
OSL |
Avg ITL (msec) |
Request Throughput (reqs/sec) |
---|---|---|---|
1000 |
1000 |
66.7 |
0.12 |
Not all performance numbers are available above. Additional data will be added to this page over time.