Performance#

Important

A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.

Performance Profiling#

You can use the genai-perf tool from perf_analyzer to benchmark the performance of VLMs under simulated production load. A quick start guide for genai-perf can be found here.

First, prepare a directory for the model’s tokenizer. genai-perf uses it to calculate statistics related to output sequence length.

Important

We recommend setting TOKENIZER_PATH to the model’s checkpoint directory pulled from HuggingFace (e.g., this directory for meta/llama-3.2-11b-vision-instruct). For models not available in HuggingFace, please use the Create Model Store utility. You can find an example for nemoretriever-parse below. The directory must include the model’s config.json file and all files relevant to the tokenizer (i.e., tokenizer.json, tokenizer_config.json, and special_tokens_map.json).

See the following example on how to use the Create Model Store utility to set up a nemoretriever-parse tokenizer for genai-perf:

export IMG_NAME=nvcr.io/nim/nvidia/nemoretriever-parse:latest  # image of the container under test
export TOKENIZER_PATH=~/.cache/nim/tokenizer                                  # the place where the tokenizer files will be located
mkdir -p $TOKENIZER_PATH
# get one of available profiles
PROFILE=$(docker run -it --rm \
    -e NGC_API_KEY \
    $IMG_NAME \
    list-model-profiles | grep -oE '[0-9a-f]{64}' | tail -1)
# create model store
docker run -it --rm \
    -e NGC_API_KEY \
    -v "$TOKENIZER_PATH:/opt/nim/.cache" \
    -u $(id -u) \
    $IMG_NAME \
    create-model-store -m /opt/nim/.cache -p $PROFILE

Now, run the tritonserver container:

export TOKENIZER_PATH=...    # this is where tokenizer.json for the model is located
docker run -it --net=host -v ${TOKENIZER_PATH}:/workspace/tokenizer  nvcr.io/nvidia/tritonserver:24.10-py3-sdk

In the container, install the version of genai-perf supporting all the containers we offer, and launch the benchmark:

Important

The following command has been tested using the 6f7c328c27dafbde62207456fb7f8366e422ee76 version of genai-perf.

Important

For the nemoretriever-parse model, the OUTPUT_SEQ_LEN can not be greater than 3579.

git clone -b main --single-branch https://github.com/triton-inference-server/perf_analyzer.git /perf_analyzer
cd /perf_analyzer
git checkout 6f7c328c27dafbde62207456fb7f8366e422ee76
pip uninstall -y genai-perf
pip install /perf_analyzer/genai-perf

export MODEL=meta/llama-3.2-11b-vision-instruct # the name of the model under test
export CONCURRENCY=8                            # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000              # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=2000                       # output sequence length (in the number of tokens) - ignored for nemoretriever-parse
export OUTPUT_SEQ_LEN=2000                      # output sequence length (in the number of tokens)
export IMAGE_WIDTH=2048                         # width of images used in profiling
export IMAGE_HEIGHT=1648                        # height of images used in profiling
export STREAMING=--streaming                    # height of images used in profiling

if [[ "$MODEL" == nvidia/nemoretriever-parse ]];
then
    export INPUT_SEQ_LEN=0
    export STREAMING=''
    export NEMORETRIEVER_PARSE_EXTRA=( "--extra-inputs" '{"tools": [{"type": "function", "function": {"name": "evaluation_markdown_bbox"}}]}' )
fi

genai-perf profile \
    -m ${MODEL} \
    --concurrency ${CONCURRENCY} \
    --tokenizer /workspace/tokenizer \
    --endpoint v1/chat/completions \
    --endpoint-type vision \
    --service-kind openai \
    ${STREAMING} -u http://127.0.0.1:8000 \
    --num-prompts 100 \
    --measurement-interval ${MEASUREMENT_INTERVAL} \
    --synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean ${OUTPUT_SEQ_LEN} \
    --extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs ignore_eos:true \
    "${NEMORETRIEVER_PARSE_EXTRA[@]}" \
    --artifact-dir tmp/ \
    -v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
    -- --max-threads ${CONCURRENCY}

The full documentation for genai-perf can be found here.

Benchmarking Results#

ISL: Text input sequence length; OSL: Text output sequence length; TTFT: Time to first token; ITL: Inter-token latency

Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.

Llama-3.2-11B-Vision-Instruct

H100-HBM3-80GB

FP8

TP1

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	198	7.4	7598

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	29.3	2.72

BF16

TP1

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	202	8.4	8602

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	28.1	2.04

A100-SXM4-80GB

BF16

TP1

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	367	13.7	14067
20000	2000	1120x1120	2690	15.5	33690

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	55.1	1.04
20000	2000	1120x1120	41.0	0.11

TP2

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	531	9.7	10231
20000	2000	1120x1120	1839	10.7	23239
60000	2000	1120x1120	6830	12.4	31630

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	53.3	1.59
20000	2000	1120x1120	55.1	0.24
60000	2000	1120x1120	63.3	0.07

L40S

BF16

TP4

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
60000	2000	1120x1120	7421	12.6	32621

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
60000	2000	1120x1120	82.7	0.06

A10G

BF16

TP4

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	1142	14.9	16042
20000	2000	1120x1120	4626	16.5	37626

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	24.3	0.32
20000	2000	1120x1120	44.7	0.08

Llama-3.2-90B-Vision-Instruct

H100-HBM3-80GB

FP8

TP4

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	531	17.4	17931
20000	2000	1120x1120	2119	18.2	38519
60000	2000	1120x1120	6347	19.0	44347

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	48.3	1.17
60000	2000	1120x1120	48.5	0.08

BF16

TP4

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
60000	2000	1120x1120	8404	25.2	58804

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
60000	2000	1120x1120	32.0	0.03

A100-SXM4-80GB

BF16

TP4

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	1060	36.6	37660
20000	2000	1120x1120	7120	38.0	83120
60000	2000	1120x1120	26692	41.1	108892

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	77.5	0.39
20000	2000	1120x1120	76.4	0.05
60000	2000	1120x1120	75.5	0.02

TP8

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	1121	26.9	28021
20000	2000	1120x1120	5081	27.5	60081
60000	2000	1120x1120	17640	29.4	76440

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	94.6	0.59
20000	2000	1120x1120	102.6	0.11
60000	2000	1120x1120	114.8	0.04

L40S

BF16

TP8

Min Latency

ISL	OSL	Image Size (width x height in px)	Avg TTFT (msec)	Avg ITL (msec)	Avg Request Latency (msec)
1000	1000	1120x1120	1102	40.4	41502

Max Throughput

ISL	OSL	Image Size (width x height in px)	Avg ITL (msec)	Request Throughput (reqs/sec)
1000	1000	1120x1120	66.7	0.12

nemoretriever-parse

H100-SXM-80GB

BF16

TP1

Min Latency

OSL	Image Size (width x height in px)	Avg Request Latency (msec)
200	1648x2048	494
1000	1648x2048	1320
2000	1648x2048	2388
3579	1648x2048	4104

Max Throughput

OSL	Image Size (width x height in px)	Request Throughput (reqs/sec)
200	1648x2048	5.07
1000	1648x2048	3.75
2000	1648x2048	2.75
3579	1648x2048	1.80

A100-SXM-80GB

BF16

TP1

Min Latency

OSL	Image Size (width x height in px)	Avg Request Latency (msec)
200	1648x2048	828
1000	1648x2048	1872
2000	1648x2048	3403
3579	1648x2048	5852

Max Throughput

OSL	Image Size (width x height in px)	Request Throughput (reqs/sec)
200	1648x2048	1.54
1000	1648x2048	2.1
2000	1648x2048	1.6
3579	1648x2048	1.1

L40S

BF16

TP1

Min Latency

OSL	Image Size (width x height in px)	Avg Request Latency (msec)
200	1648x2048	710
1000	1648x2048	1822
2000	1648x2048	3244
3579	1648x2048	5617

Max Throughput

OSL	Image Size (width x height in px)	Request Throughput (reqs/sec)
200	1648x2048	2.6
1000	1648x2048	1.7
2000	1648x2048	1.1
3579	1648x2048	0.7

Not all performance numbers are available above. Additional data will be added to this page over time.