Performance#

Important

A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.

Performance Profiling#

You can use the genai-perf tool from perf_analyzer to benchmark the performance of VLMs under simulated production load.

Important

Update the model name according to your requirements. For example, for a meta/llama-3.2-11b-vision-instruct model, you can use the following command:

Execute the following command to run a performance benchmark using the genai-perf command line tool. A quick start guide for genai-perf can be found here.

First, run the container with genai-perf installed:

export TOKENIZER_PATH=...    # this is where tokenizer.json for the model is located

docker run -it --net=host --gpus=all  -v ${TOKENIZER_PATH}:/workspace/tokenizer nvcr.io/nvidia/tritonserver:24.10-py3-sdk

Important

We recommend setting TOKENIZER_PATH to the model’s checkpoint directory pulled from HuggingFace (e.g., this for meta/llama-3.2-11b-vision-instruct). The directory must include the model’s config.json and all files relevant to the tokenizer (i.e., tokenizer.json, tokenizer_config.json, and special_tokens_map.json).

Important

The following command has been tested using the genai-perf 24.10 release.

export CONCURRENCY=8                 # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000   # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=5000            # input seqence length (in the number of tokens)
export OUTPUT_SEQ_LEN=5000           # output sequence length (in the number of tokens)
export IMAGE_WIDTH=1120              # width of images used in profiling
export IMAGE_HEIGHT=1120             # height of images used in profiling

genai-perf profile \
    -m meta/llama-3.2-11b-vision-instruct \
    --concurrency ${CONCURRENCY} \
    --tokenizer /workspace/tokenizer \
    --endpoint v1/chat/completions \
    --endpoint-type vision \
    --service-kind openai \
    --streaming -u http://127.0.0.1:8000 \
    --num-prompts 100 \
    --measurement-interval ${MEASUREMENT_INTERVAL} \
    --synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean ${OUTPUT_SEQ_LEN} \
    --extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs ignore_eos:true \
    --artifact-dir tmp/ \
    -v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
    -- --max-threads ${CONCURRENCY}

The full documentation for genai-perf can be found here.

Benchmarking Results#

ISL: Text input sequence length; OSL: Text output sequence length; TTFT: Time to first token; ITL: Inter-token latency

Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.

Image size: 1120x1120px

Llama-3.2-11B-Vision-Instruct

H100-HBM3-80GB

FP8

TP1

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	198	7.4	7598

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	29.3	2.72

BF16

TP1

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	202	8.4	8602

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	28.1	2.04

A100-SXM4-80GB

BF16

TP1

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	367	13.7	14067
20000	2000	2690	15.5	33690

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	55.1	1.04
20000	2000	41.0	0.11

TP2

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	531	9.7	10231
20000	2000	1839	10.7	23239
60000	2000	6830	12.4	31630

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	53.3	1.59
20000	2000	55.1	0.24
60000	2000	63.3	0.07

L40S

BF16

TP4

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
60000	2000	7421	12.6	32621

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
60000	2000	82.7	0.06

A10G

BF16

TP4

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1142	14.9	16042
20000	2000	4626	16.5	37626

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	24.3	0.32
20000	2000	44.7	0.08

Llama-3.2-90B-Vision-Instruct

H100-HBM3-80GB

FP8

TP4

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	531	17.4	17931
20000	2000	2119	18.2	38519
60000	2000	6347	19.0	44347

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	48.3	1.17
60000	2000	48.5	0.08

BF16

TP4

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
60000	2000	8404	25.2	58804

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
60000	2000	32.0	0.03

A100-SXM4-80GB

BF16

TP4

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1060	36.6	37660
20000	2000	7120	38.0	83120
60000	2000	26692	41.1	108892

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	77.5	0.39
20000	2000	76.4	0.05
60000	2000	75.5	0.02

TP8

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1121	26.9	28021
20000	2000	5081	27.5	60081
60000	2000	17640	29.4	76440

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	94.6	0.59
20000	2000	102.6	0.11
60000	2000	114.8	0.04

L40S

BF16

TP8

Min Latency

ISL	OSL	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1102	40.4	41502

Max Throughput

ISL	OSL	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	66.7	0.12

Not all performance numbers are available above. Additional data will be added to this page over time.