Performance#

Important

A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.

Performance Profiling#

You can use the genai-perf tool from perf_analyzer to benchmark the performance of VLMs under simulated production load. A quick start guide for genai-perf can be found here.

First, prepare a directory for the model’s tokenizer. genai-perf uses it to calculate statistics related to output sequence length.

Important

We recommend setting TOKENIZER_PATH to the model’s checkpoint directory pulled from HuggingFace (e.g., this directory for meta/llama-3.2-11b-vision-instruct). For models not available in HuggingFace, please use the Create Model Store utility. You can find an example for nemoretriever-parse below. The directory must include the model’s config.json file and all files relevant to the tokenizer (i.e., tokenizer.json, tokenizer_config.json, and special_tokens_map.json).

See the following example on how to use the Create Model Store utility to set up a nemoretriever-parse tokenizer for genai-perf:

export IMG_NAME=nvcr.io/nim/nvidia/nemoretriever-parse:latest  # image of the container under test
export TOKENIZER_PATH=~/.cache/nim/tokenizer                                  # the place where the tokenizer files will be located
mkdir -p $TOKENIZER_PATH
# get one of available profiles
PROFILE=$(docker run -it --rm \
    -e NGC_API_KEY \
    $IMG_NAME \
    list-model-profiles | grep -oE '[0-9a-f]{64}' | tail -1)
# create model store
docker run -it --rm \
    -e NGC_API_KEY \
    -v "$TOKENIZER_PATH:/opt/nim/.cache" \
    -u $(id -u) \
    $IMG_NAME \
    create-model-store -m /opt/nim/.cache -p $PROFILE

Now, run the tritonserver container:

export TOKENIZER_PATH=...    # this is where tokenizer.json for the model is located
docker run -it --net=host -v ${TOKENIZER_PATH}:/workspace/tokenizer  nvcr.io/nvidia/tritonserver:24.10-py3-sdk

In the container, install the version of genai-perf supporting all the containers we offer, and launch the benchmark:

Important

The following command has been tested using the 6f7c328c27dafbde62207456fb7f8366e422ee76 version of genai-perf.

Important

For the nemoretriever-parse model, the OUTPUT_SEQ_LEN can not be greater than 3579.

git clone -b main --single-branch https://github.com/triton-inference-server/perf_analyzer.git /perf_analyzer
cd /perf_analyzer
git checkout 6f7c328c27dafbde62207456fb7f8366e422ee76
pip uninstall -y genai-perf
pip install /perf_analyzer/genai-perf

export MODEL=meta/llama-3.2-11b-vision-instruct # the name of the model under test
export CONCURRENCY=8                            # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000              # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=2000                       # output sequence length (in the number of tokens) - ignored for nemoretriever-parse
export OUTPUT_SEQ_LEN=2000                      # output sequence length (in the number of tokens)
export IMAGE_WIDTH=2048                         # width of images used in profiling
export IMAGE_HEIGHT=1648                        # height of images used in profiling
export STREAMING=--streaming                    # height of images used in profiling

if [[ "$MODEL" == nvidia/nemoretriever-parse ]];
then
    export INPUT_SEQ_LEN=0
    export STREAMING=''
    export NEMORETRIEVER_PARSE_EXTRA=( "--extra-inputs" '{"tools": [{"type": "function", "function": {"name": "evaluation_markdown_bbox"}}]}' )
fi

genai-perf profile \
    -m ${MODEL} \
    --concurrency ${CONCURRENCY} \
    --tokenizer /workspace/tokenizer \
    --endpoint v1/chat/completions \
    --endpoint-type vision \
    --service-kind openai \
    ${STREAMING} -u http://127.0.0.1:8000 \
    --num-prompts 100 \
    --measurement-interval ${MEASUREMENT_INTERVAL} \
    --synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean ${OUTPUT_SEQ_LEN} \
    --extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs ignore_eos:true \
    "${NEMORETRIEVER_PARSE_EXTRA[@]}" \
    --artifact-dir tmp/ \
    -v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
    -- --max-threads ${CONCURRENCY}

The full documentation for genai-perf can be found here.

Benchmarking Results#

ISL: Text input sequence length; OSL: Text output sequence length; TTFT: Time to first token; ITL: Inter-token latency

Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

198

7.4

7598

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

29.3

2.72

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

202

8.4

8602

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

28.1

2.04

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

367

13.7

14067

20000

2000

1120x1120

2690

15.5

33690

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

55.1

1.04

20000

2000

1120x1120

41.0

0.11

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

531

9.7

10231

20000

2000

1120x1120

1839

10.7

23239

60000

2000

1120x1120

6830

12.4

31630

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

53.3

1.59

20000

2000

1120x1120

55.1

0.24

60000

2000

1120x1120

63.3

0.07

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

60000

2000

1120x1120

7421

12.6

32621

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

60000

2000

1120x1120

82.7

0.06

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

1142

14.9

16042

20000

2000

1120x1120

4626

16.5

37626

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

24.3

0.32

20000

2000

1120x1120

44.7

0.08

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

531

17.4

17931

20000

2000

1120x1120

2119

18.2

38519

60000

2000

1120x1120

6347

19.0

44347

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

48.3

1.17

60000

2000

1120x1120

48.5

0.08

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

60000

2000

1120x1120

8404

25.2

58804

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

60000

2000

1120x1120

32.0

0.03

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

1060

36.6

37660

20000

2000

1120x1120

7120

38.0

83120

60000

2000

1120x1120

26692

41.1

108892

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

77.5

0.39

20000

2000

1120x1120

76.4

0.05

60000

2000

1120x1120

75.5

0.02

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

1121

26.9

28021

20000

2000

1120x1120

5081

27.5

60081

60000

2000

1120x1120

17640

29.4

76440

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

94.6

0.59

20000

2000

1120x1120

102.6

0.11

60000

2000

1120x1120

114.8

0.04

ISL

OSL

Image Size (width x height in px)

Avg TTFT (msec)

Avg ITL (msec)

Avg Request Latency (msec)

1000

1000

1120x1120

1102

40.4

41502

ISL

OSL

Image Size (width x height in px)

Avg ITL (msec)

Request Throughput (reqs/sec)

1000

1000

1120x1120

66.7

0.12

ISL

OSL

Image Size (width x height in px)

Avg Request Latency (msec)

0

200

1648x2048

494

0

1000

1648x2048

1320

0

2000

1648x2048

2388

0

3579

1648x2048

4104

ISL

OSL

Image Size (width x height in px)

Request Throughput (reqs/sec)

0

200

1648x2048

5.07

0

1000

1648x2048

3.75

0

2000

1648x2048

2.75

0

3579

1648x2048

1.80

ISL

OSL

Image Size (width x height in px)

Avg Request Latency (msec)

0

200

1648x2048

828

0

1000

1648x2048

1872

0

2000

1648x2048

3403

0

3579

1648x2048

5852

ISL

OSL

Image Size (width x height in px)

Request Throughput (reqs/sec)

0

200

1648x2048

1.54

0

1000

1648x2048

2.1

0

2000

1648x2048

1.6

0

3579

1648x2048

1.1

ISL

OSL

Image Size (width x height in px)

Avg Request Latency (msec)

0

200

1648x2048

710

0

1000

1648x2048

1822

0

2000

1648x2048

3244

0

3579

1648x2048

5617

ISL

OSL

Image Size (width x height in px)

Request Throughput (reqs/sec)

0

200

1648x2048

2.6

0

1000

1648x2048

1.7

0

2000

1648x2048

1.1

0

3579

1648x2048

0.7

Not all performance numbers are available above. Additional data will be added to this page over time.