Performance#
Important
A comprehensive guide to NIM VLM performance benchmarking is under development. In the meantime, developers are encouraged to consult NIM for LLM Benchmarking Guide.
Performance Profiling#
You can use the genai-perf
tool from perf_analyzer to benchmark the performance of VLMs under simulated production load.
Important
Update the model name according to your requirements. For example, for a meta/llama-3.2-11b-vision-instruct
model, you can use the following command:
Execute the following command to run a performance benchmark using the genai-perf
command line tool. A quick start guide for genai-perf
can be found here.
First, run the container with genai-perf
installed:
export TOKENIZER_PATH=... # this is where tokenizer.json for the model is located
docker run -it --net=host --gpus=all -v ${TOKENIZER_PATH}:/workspace/tokenizer nvcr.io/nvidia/tritonserver:24.10-py3-sdk
Important
We recommend setting TOKENIZER_PATH
to the model’s checkpoint directory pulled from HuggingFace (e.g., this for meta/llama-3.2-11b-vision-instruct
). The directory must include the model’s config.json
and all files relevant to the tokenizer (i.e., tokenizer.json
, tokenizer_config.json
, and special_tokens_map.json
).
Important
The following command has been tested using the genai-perf
24.10 release.
export CONCURRENCY=8 # number of requests sent concurrently
export MEASUREMENT_INTERVAL=300000 # max time window (in milliseconds) for genai-perf to wait for a response
export INPUT_SEQ_LEN=5000 # input seqence length (in the number of tokens)
export OUTPUT_SEQ_LEN=5000 # output sequence length (in the number of tokens)
export IMAGE_WIDTH=1120 # width of images used in profiling
export IMAGE_HEIGHT=1120 # height of images used in profiling
genai-perf profile \
-m meta/llama-3.2-11b-vision-instruct \
--concurrency ${CONCURRENCY} \
--tokenizer /workspace/tokenizer \
--endpoint v1/chat/completions \
--endpoint-type vision \
--service-kind openai \
--streaming -u http://127.0.0.1:8000 \
--num-prompts 100 \
--measurement-interval ${MEASUREMENT_INTERVAL} \
--synthetic-input-tokens-mean ${INPUT_SEQ_LEN} \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean ${OUTPUT_SEQ_LEN} \
--extra-inputs max_tokens:${OUTPUT_SEQ_LEN} \
--extra-inputs min_tokens:${OUTPUT_SEQ_LEN} \
--extra-inputs ignore_eos:true \
--artifact-dir tmp/ \
-v --image-width-mean ${IMAGE_WIDTH} --image-width-stddev 0 --image-height-mean ${IMAGE_HEIGHT} --image-height-stddev 0 --image-format png \
-- --max-threads ${CONCURRENCY}
The full documentation for genai-perf
can be found here.
Benchmarking Results#
ISL: Input sequence length; OSL: Output sequence length; TTFT: Time to first token; ITL: Inter-token latency
Min Latency numbers are captured with concurrency 1 (single stream). Max Throughput numbers are measured with the maximum concurrency saturating the throughput.
Not all performance numbers are available above. Additional data will be added to this page over time.