Profile Multi-Modal Language Models with GenAI-Perf#
GenAI-Perf allows you to profile Multi-Modal Language Models (MMLM) running on OpenAI Chat Completions API-compatible server by sending multi-modal contents to the server (for instance, see OpenAI vision and audio inputs).
Quickstart 1. Run GenAI-Perf on Vision Language Model (VLM)#
Start OpenAI API compatible server with a VLM model using following command:
docker run --runtime nvidia --gpus all \
-p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model llava-hf/llava-v1.6-mistral-7b-hf --dtype float16
Use GenAI-Perf to generate/send text and image request data to the server
genai-perf profile \
-m llava-hf/llava-v1.6-mistral-7b-hf \
--endpoint-type multimodal \
--image-width-mean 50 \
--image-height-mean 50 \
--synthetic-input-tokens-mean 10 \
--output-tokens-mean 10 \
--streaming
Console output will have the following result table
NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time To First Token (ms) │ 205.38 │ 172.31 │ 1,020.58 │ 246.02 │ 207.33 │ 204.96 │
│ Time To Second Token (ms) │ 18.58 │ 18.03 │ 19.22 │ 19.13 │ 18.72 │ 18.65 │
│ Request Latency (ms) │ 369.06 │ 336.56 │ 1,183.65 │ 408.84 │ 370.97 │ 368.50 │
│ Inter Token Latency (ms) │ 16.41 │ 16.27 │ 18.32 │ 18.16 │ 16.47 │ 16.41 │
│ Output Sequence Length (tokens) │ 10.98 │ 10.00 │ 11.00 │ 11.00 │ 11.00 │ 11.00 │
│ Input Sequence Length (tokens) │ 10.06 │ 10.00 │ 11.00 │ 11.00 │ 10.00 │ 10.00 │
│ Output Token Throughput (per sec) │ 29.74 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 2.71 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 97.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴────────┴────────┴──────────┴────────┴────────┴────────┘
Quickstart 2. Run GenAI-Perf on Multi-Modal Language Model (MMLM)#
In this example, we will measure performance of the recent Multi-Modal Language Model (MMLM)
Phi-4-multimodal-instruct
from Microsoft hosted on NVIDIA API which is OpenAI API compatible.
First, visit https://build.nvidia.com/microsoft/phi-4-multimodal-instruct and create API key.
Run GenAI-Perf to generate/send all three modalities to the server
export NVIDIA_API_KEY=your_api_key
genai-perf profile \
-m microsoft/phi-4-multimodal-instruct \
-u https://integrate.api.nvidia.com \
--endpoint-type multimodal \
--synthetic-input-tokens-mean 10 \
--output-tokens-mean 10 \
--image-width-mean 50 \
--image-height-mean 50 \
--audio-length-mean 3 \
--audio-depths 16 32 \
--audio-sample-rates 16 44.1 48 \
--audio-num-channels 2 \
--audio-format wav \
--streaming \
--header "Authorization: Bearer '${NVIDIA_API_KEY}'"
Console output will have the following result table
NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time To First Token (ms) │ 284.94 │ 228.38 │ 562.57 │ 371.15 │ 325.76 │ 295.26 │
│ Time To Second Token (ms) │ 10.02 │ 8.79 │ 10.94 │ 10.89 │ 10.59 │ 10.32 │
│ Request Latency (ms) │ 346.80 │ 251.96 │ 662.96 │ 463.03 │ 404.79 │ 382.18 │
│ Inter Token Latency (ms) │ 9.56 │ 0.00 │ 15.76 │ 14.22 │ 11.05 │ 10.81 │
│ Output Sequence Length (tokens) │ 7.43 │ 1.00 │ 14.00 │ 14.00 │ 12.00 │ 10.00 │
│ Input Sequence Length (tokens) │ 10.11 │ 10.00 │ 11.00 │ 11.00 │ 11.00 │ 10.00 │
│ Output Token Throughput (per sec) │ 20.96 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 2.82 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 101.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
Generating Multi-Modal Requests with GenAI-Perf#
Currently, you can send multi-modal contents with GenAI-Perf using the following two approaches:
The synthetic data generation approach, where GenAI-Perf generates the multi-modal data for you.
The Bring Your Own Data (BYOD) approach, where you provide GenAI-Perf with the data to send.
Approach 1: Synthetic Multi-Modal Data Generation#
GenAI-Perf can generate synthetic data of three modalities (text, image, and audio) using the modality-specific parameters provide by the user through CLI. Checkout CLI Input Options for a complete list of parameters that you can tweak for different modalities.
genai-perf profile \
-m <multimodal_model> \
--endpoint-type multimodal \
# audio parameters
--audio-length-mean 10 \
--audio-length-stddev 2 \
--audio-depths 16 32 \
--audio-sample-rates 16 44.1 48 \
--audio-num-channels 1 \
--audio-format wav \
# image parameters
--image-width-mean 512 \
--image-width-stddev 30 \
--image-height-mean 512 \
--image-height-stddev 30 \
--image-format png \
# text parameters
--synthetic-input-tokens-mean 100 \
--synthetic-input-tokens-stddev 0 \
--streaming
[!Note] Under the hood, GenAI-Perf generates synthetic images using a few source images under the
inputs/source_images
directory. If you would like to add/remove/edit the source images, you can do so by directly editing the source images under the directory. GenAI-Perf will pickup the images under the directory automatically when generating the synthetic images.
Approach 2: Bring Your Own Data (BYOD)#
[!Note] This approach only supports text and image inputs at the moment.
Instead of letting GenAI-Perf create the synthetic data,
you can also provide GenAI-Perf with your own data using
--input-file
CLI option.
The input file must be in JSONL format, where each line can define both text
and image
data.
The image
field can be either a path to a local image file or a URL.
GenAI-Perf converts local image filepaths to base64-encoded strings and leaves URL paths as is,
consistent with the OpenAI Vision Guide.
A sample input file input.jsonl
would look like
// texts and local image files
{"text": "What is in this image?", "image": "path/to/image1.png"}
{"text": "What is the color of the dog?", "image": "path/to/image2.jpeg"}
// Or, text and URL path
{"text": "What is in this image?", "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}
After you create the file, you can run GenAI-Perf using the following command:
genai-perf profile \
-m <multimodal_model> \
--endpoint-type multimodal \
--input-file input.jsonl \
--streaming