Performance#

Important

A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.

Performance Profiling#

You can use the genai-perf tool from perf_analyzer to benchmark the performance of VLMs under simulated production load.

Important

Update the model name according to your requirements. For example, for a meta/llama-3.2-11b-vision-instruct model, you can use the following command:

Execute the following command to run a performance benchmark using the genai-perf command line tool. A quick start guide for genai-perf can be found here.

First, run the container with genai-perf installed:

export TOKENIZER_PATH=...    # this is where tokenizer.json for the model is located

docker run -it --net=host --gpus=all  -v ${TOKENIZER_PATH}:/workspace/tokenizer nvcr.io/nvidia/tritonserver:24.10-py3-sdk

Important

We recommend setting TOKENIZER_PATH to the model’s checkpoint directory pulled from HuggingFace (e.g., this for meta/llama-3.2-11b-vision-instruct). The directory must include the model’s config.json and all files relevant to the tokenizer (i.e., tokenizer.json, tokenizer_config.json, and special_tokens_map.json).

Important

The following command has been tested using the genai-perf 24.10 release.

export CONCURRENCY=8                 # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000   # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=5000            # input seqence length (in the number of tokens)
export OUTPUT_SEQ_LEN=5000           # output sequence length (in the number of tokens)
export IMAGE_WIDTH=1120              # width of images used in profiling
export IMAGE_HEIGHT=1120             # height of images used in profiling

genai-perf profile \
    -m meta/llama-3.2-11b-vision-instruct \
    --concurrency ${CONCURRENCY} \
    --tokenizer /workspace/tokenizer \
    --endpoint v1/chat/completions \
    --endpoint-type vision \
    --service-kind openai \
    --streaming -u http://127.0.0.1:8000 \
    --num-prompts 100 \
    --measurement-interval ${MEASUREMENT_INTERVAL} \
    --synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean ${OUTPUT_SEQ_LEN} \
    --extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs ignore_eos:true \
    --artifact-dir tmp/ \
    -v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
    -- --max-threads ${CONCURRENCY}

The full documentation for genai-perf can be found here.

Benchmarking Results#

ISL: Text input sequence length; OSL: Text output sequence length; TTFT: Time to first token; ITL: Inter-token latency

Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.

Image size: 1120x1120px

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

198

7.4

7598

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

29.3

2.72

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

202

8.4

8602

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

28.1

2.04

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

367

13.7

14067

20000

2000

2690

15.5

33690

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

55.1

1.04

20000

2000

41.0

0.11

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

531

9.7

10231

20000

2000

1839

10.7

23239

60000

2000

6830

12.4

31630

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

53.3

1.59

20000

2000

55.1

0.24

60000

2000

63.3

0.07

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

60000

2000

7421

12.6

32621

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

60000

2000

82.7

0.06

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1142

14.9

16042

20000

2000

4626

16.5

37626

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

24.3

0.32

20000

2000

44.7

0.08

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

531

17.4

17931

20000

2000

2119

18.2

38519

60000

2000

6347

19.0

44347

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

48.3

1.17

60000

2000

48.5

0.08

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

60000

2000

8404

25.2

58804

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

60000

2000

32.0

0.03

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1060

36.6

37660

20000

2000

7120

38.0

83120

60000

2000

26692

41.1

108892

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

77.5

0.39

20000

2000

76.4

0.05

60000

2000

75.5

0.02

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1121

26.9

28021

20000

2000

5081

27.5

60081

60000

2000

17640

29.4

76440

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

94.6

0.59

20000

2000

102.6

0.11

60000

2000

114.8

0.04

ISL

OSL

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1102

40.4

41502

ISL

OSL

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

66.7

0.12

Not all performance numbers are available above. Additional data will be added to this page over time.