Benchmarking#
Important
A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.
Performance Profiling#
You can use the genai-perf tool from
perf_analyzer to
benchmark the performance of VLMs under simulated production load. A quick start
guide for genai-perf can be found
here.
First, prepare a directory for the model’s tokenizer. genai-perf uses it to calculate statistics related to output sequence length.
Important
We recommend setting TOKENIZER_PATH to the model’s checkpoint directory pulled from HuggingFace (e.g., this directory for meta/llama-3.2-11b-vision-instruct).
For models not available in HuggingFace, please use the Create Model Store utility. You can find an example for nemoretriever-parse below.
The directory must include the model’s
config.json file and all files relevant to the tokenizer (i.e.,
tokenizer.json, tokenizer_config.json, and special_tokens_map.json).
See the following example on how to use the Create Model Store utility to set up a nemoretriever-parse tokenizer for genai-perf:
export IMG_NAME=nvcr.io/nim/nvidia/nemoretriever-parse:latest  # image of the container under test
export TOKENIZER_PATH=~/.cache/nim/tokenizer                                  # the place where the tokenizer files will be located
mkdir -p $TOKENIZER_PATH
# get one of available profiles
PROFILE=$(docker run -it --rm \
    -e NGC_API_KEY \
    $IMG_NAME \
    list-model-profiles | grep -oE '[0-9a-f]{64}' | tail -1)
# create model store
docker run -it --rm \
    -e NGC_API_KEY \
    -v "$TOKENIZER_PATH:/opt/nim/.cache" \
    -u $(id -u) \
    $IMG_NAME \
    create-model-store -m /opt/nim/.cache -p $PROFILE
Now, run the tritonserver container:
export TOKENIZER_PATH=...    # this is where tokenizer.json for the model is located
docker run -it --net=host -v ${TOKENIZER_PATH}:/workspace/tokenizer  nvcr.io/nvidia/tritonserver:24.10-py3-sdk
In the container, install the version of genai-perf supporting all the containers we offer, and launch the benchmark:
Important
The following command has been tested using the 6f7c328c27dafbde62207456fb7f8366e422ee76 version of genai-perf.
Important
For the nemoretriever-parse model, the OUTPUT_SEQ_LEN can not be greater than 3579.
git clone -b main --single-branch https://github.com/triton-inference-server/perf_analyzer.git /perf_analyzer
cd /perf_analyzer
git checkout 6f7c328c27dafbde62207456fb7f8366e422ee76
pip uninstall -y genai-perf
pip install /perf_analyzer/genai-perf
export MODEL=meta/llama-3.2-11b-vision-instruct # the name of the model under test
export CONCURRENCY=8                            # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000              # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=2000                       # output sequence length (in the number of tokens) - ignored for nemoretriever-parse
export OUTPUT_SEQ_LEN=2000                      # output sequence length (in the number of tokens)
export IMAGE_WIDTH=2048                         # width of images used in profiling
export IMAGE_HEIGHT=1648                        # height of images used in profiling
export STREAMING=--streaming                    # height of images used in profiling
if [[ "$MODEL" == nvidia/nemoretriever-parse ]];
then
    export INPUT_SEQ_LEN=0
    export STREAMING=''
    export NEMORETRIEVER_PARSE_EXTRA=( "--extra-inputs" '{"tools": [{"type": "function", "function": {"name": "evaluation_markdown_bbox"}}]}' )
fi
genai-perf profile \
    -m ${MODEL} \
    --concurrency ${CONCURRENCY} \
    --tokenizer /workspace/tokenizer \
    --endpoint v1/chat/completions \
    --endpoint-type vision \
    --service-kind openai \
    ${STREAMING} -u http://127.0.0.1:8000 \
    --num-prompts 100 \
    --measurement-interval ${MEASUREMENT_INTERVAL} \
    --synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean ${OUTPUT_SEQ_LEN} \
    --extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
    --extra-inputs ignore_eos:true \
    "${NEMORETRIEVER_PARSE_EXTRA[@]}" \
    --artifact-dir tmp/ \
    -v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
    -- --max-threads ${CONCURRENCY}
The full documentation for genai-perf can be found here.
Benchmarking Results#
ISL: Text input sequence length; OSL: Text output sequence length; TTFT: Time to first token; ITL: Inter-token latency
Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.
Note
Benchmarking results are not available for Llama 3.1 Nemotron Nano VL 8B v1.
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 198 | 7.4 | 7598 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 29.3 | 2.72 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 202 | 8.4 | 8602 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 28.1 | 2.04 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 367 | 13.7 | 14067 | 
| 20000 | 2000 | 1120x1120 | 2690 | 15.5 | 33690 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 55.1 | 1.04 | 
| 20000 | 2000 | 1120x1120 | 41.0 | 0.11 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 531 | 9.7 | 10231 | 
| 20000 | 2000 | 1120x1120 | 1839 | 10.7 | 23239 | 
| 60000 | 2000 | 1120x1120 | 6830 | 12.4 | 31630 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 53.3 | 1.59 | 
| 20000 | 2000 | 1120x1120 | 55.1 | 0.24 | 
| 60000 | 2000 | 1120x1120 | 63.3 | 0.07 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 60000 | 2000 | 1120x1120 | 7421 | 12.6 | 32621 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 60000 | 2000 | 1120x1120 | 82.7 | 0.06 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 1142 | 14.9 | 16042 | 
| 20000 | 2000 | 1120x1120 | 4626 | 16.5 | 37626 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 24.3 | 0.32 | 
| 20000 | 2000 | 1120x1120 | 44.7 | 0.08 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 531 | 17.4 | 17931 | 
| 20000 | 2000 | 1120x1120 | 2119 | 18.2 | 38519 | 
| 60000 | 2000 | 1120x1120 | 6347 | 19.0 | 44347 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 48.3 | 1.17 | 
| 60000 | 2000 | 1120x1120 | 48.5 | 0.08 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 60000 | 2000 | 1120x1120 | 8404 | 25.2 | 58804 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 60000 | 2000 | 1120x1120 | 32.0 | 0.03 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 1060 | 36.6 | 37660 | 
| 20000 | 2000 | 1120x1120 | 7120 | 38.0 | 83120 | 
| 60000 | 2000 | 1120x1120 | 26692 | 41.1 | 108892 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 77.5 | 0.39 | 
| 20000 | 2000 | 1120x1120 | 76.4 | 0.05 | 
| 60000 | 2000 | 1120x1120 | 75.5 | 0.02 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 1121 | 26.9 | 28021 | 
| 20000 | 2000 | 1120x1120 | 5081 | 27.5 | 60081 | 
| 60000 | 2000 | 1120x1120 | 17640 | 29.4 | 76440 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 94.6 | 0.59 | 
| 20000 | 2000 | 1120x1120 | 102.6 | 0.11 | 
| 60000 | 2000 | 1120x1120 | 114.8 | 0.04 | 
| ISL | OSL | Image Size (width x height in px) | Avg TTFT (msec) | Avg ITL (msec) | Avg Request Latency (msec) | 
|---|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 1102 | 40.4 | 41502 | 
| ISL | OSL | Image Size (width x height in px) | Avg ITL (msec) | Request Throughput (reqs/sec) | 
|---|---|---|---|---|
| 1000 | 1000 | 1120x1120 | 66.7 | 0.12 | 
| ISL | OSL | Image Size (width x height in px) | Avg Request Latency (msec) | 
|---|---|---|---|
| 0 | 200 | 1648x2048 | 494 | 
| 0 | 1000 | 1648x2048 | 1320 | 
| 0 | 2000 | 1648x2048 | 2388 | 
| 0 | 3579 | 1648x2048 | 4104 | 
| ISL | OSL | Image Size (width x height in px) | Request Throughput (reqs/sec) | 
|---|---|---|---|
| 0 | 200 | 1648x2048 | 5.07 | 
| 0 | 1000 | 1648x2048 | 3.75 | 
| 0 | 2000 | 1648x2048 | 2.75 | 
| 0 | 3579 | 1648x2048 | 1.80 | 
| ISL | OSL | Image Size (width x height in px) | Avg Request Latency (msec) | 
|---|---|---|---|
| 0 | 200 | 1648x2048 | 828 | 
| 0 | 1000 | 1648x2048 | 1872 | 
| 0 | 2000 | 1648x2048 | 3403 | 
| 0 | 3579 | 1648x2048 | 5852 | 
| ISL | OSL | Image Size (width x height in px) | Request Throughput (reqs/sec) | 
|---|---|---|---|
| 0 | 200 | 1648x2048 | 1.54 | 
| 0 | 1000 | 1648x2048 | 2.1 | 
| 0 | 2000 | 1648x2048 | 1.6 | 
| 0 | 3579 | 1648x2048 | 1.1 | 
| ISL | OSL | Image Size (width x height in px) | Avg Request Latency (msec) | 
|---|---|---|---|
| 0 | 200 | 1648x2048 | 710 | 
| 0 | 1000 | 1648x2048 | 1822 | 
| 0 | 2000 | 1648x2048 | 3244 | 
| 0 | 3579 | 1648x2048 | 5617 | 
| ISL | OSL | Image Size (width x height in px) | Request Throughput (reqs/sec) | 
|---|---|---|---|
| 0 | 200 | 1648x2048 | 2.6 | 
| 0 | 1000 | 1648x2048 | 1.7 | 
| 0 | 2000 | 1648x2048 | 1.1 | 
| 0 | 3579 | 1648x2048 | 0.7 | 
Not all performance numbers are available above. Additional data will be added to this page over time.