Profile Multi-Modal Language Models with GenAI-Perf#
GenAI-Perf allows you to profile Multi-Modal Language Models (MMLM) running on OpenAI Chat Completions API-compatible server by sending multi-modal contents to the server (for instance, see OpenAI vision and audio inputs).
Quickstart 1. Run GenAI-Perf on Vision Language Model (VLM)#
Start OpenAI API compatible server with a VLM model using following command:
docker run --runtime nvidia --gpus all \
-p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model llava-hf/llava-v1.6-mistral-7b-hf --dtype float16
Use GenAI-Perf to generate/send text and image request data to the server
genai-perf profile \
-m llava-hf/llava-v1.6-mistral-7b-hf \
--endpoint-type multimodal \
--image-width-mean 50 \
--image-height-mean 50 \
--synthetic-input-tokens-mean 10 \
--output-tokens-mean 10 \
--streaming
Console output will have the following result table
NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time To First Token (ms) │ 205.38 │ 172.31 │ 1,020.58 │ 246.02 │ 207.33 │ 204.96 │
│ Time To Second Token (ms) │ 18.58 │ 18.03 │ 19.22 │ 19.13 │ 18.72 │ 18.65 │
│ Request Latency (ms) │ 369.06 │ 336.56 │ 1,183.65 │ 408.84 │ 370.97 │ 368.50 │
│ Inter Token Latency (ms) │ 16.41 │ 16.27 │ 18.32 │ 18.16 │ 16.47 │ 16.41 │
│ Output Sequence Length (tokens) │ 10.98 │ 10.00 │ 11.00 │ 11.00 │ 11.00 │ 11.00 │
│ Input Sequence Length (tokens) │ 10.06 │ 10.00 │ 11.00 │ 11.00 │ 10.00 │ 10.00 │
│ Output Token Throughput (per sec) │ 29.74 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 2.71 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 97.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴────────┴────────┴──────────┴────────┴────────┴────────┘
Quickstart 2. Run GenAI-Perf on Multi-Modal Language Model (MMLM)#
In this example, we will measure performance of the recent Multi-Modal Language Model (MMLM)
Phi-4-multimodal-instruct
from Microsoft hosted on NVIDIA API which is OpenAI API compatible.
First, visit https://build.nvidia.com/microsoft/phi-4-multimodal-instruct and create API key.
Run GenAI-Perf to generate/send all three modalities to the server
export NVIDIA_API_KEY=your_api_key
genai-perf profile \
-m microsoft/phi-4-multimodal-instruct \
-u https://integrate.api.nvidia.com \
--endpoint-type multimodal \
--synthetic-input-tokens-mean 10 \
--output-tokens-mean 10 \
--image-width-mean 50 \
--image-height-mean 50 \
--audio-length-mean 3 \
--audio-depths 16 32 \
--audio-sample-rates 16 44.1 48 \
--audio-num-channels 2 \
--audio-format wav \
--streaming \
--header "Authorization: Bearer '${NVIDIA_API_KEY}'"
Console output will have the following result table
NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time To First Token (ms) │ 284.94 │ 228.38 │ 562.57 │ 371.15 │ 325.76 │ 295.26 │
│ Time To Second Token (ms) │ 10.02 │ 8.79 │ 10.94 │ 10.89 │ 10.59 │ 10.32 │
│ Request Latency (ms) │ 346.80 │ 251.96 │ 662.96 │ 463.03 │ 404.79 │ 382.18 │
│ Inter Token Latency (ms) │ 9.56 │ 0.00 │ 15.76 │ 14.22 │ 11.05 │ 10.81 │
│ Output Sequence Length (tokens) │ 7.43 │ 1.00 │ 14.00 │ 14.00 │ 12.00 │ 10.00 │
│ Input Sequence Length (tokens) │ 10.11 │ 10.00 │ 11.00 │ 11.00 │ 11.00 │ 10.00 │
│ Output Token Throughput (per sec) │ 20.96 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 2.82 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 101.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
Generating Multi-Modal Requests with GenAI-Perf#
Currently, you can send multi-modal contents with GenAI-Perf using the following two approaches:
The synthetic data generation approach, where GenAI-Perf generates the multi-modal data for you.
The Bring Your Own Data (BYOD) approach, where you provide GenAI-Perf with the data to send.
Approach 1: Synthetic Multi-Modal Data Generation#
GenAI-Perf can generate synthetic data of three modalities (text, image, and audio) using the modality-specific parameters provide by the user through CLI. Checkout CLI Input Options for a complete list of parameters that you can tweak for different modalities.
genai-perf profile \
-m <multimodal_model> \
--endpoint-type multimodal \
# audio parameters
--audio-length-mean 10 \
--audio-length-stddev 2 \
--audio-depths 16 32 \
--audio-sample-rates 16 44.1 48 \
--audio-num-channels 1 \
--audio-format wav \
# image parameters
--image-width-mean 512 \
--image-width-stddev 30 \
--image-height-mean 512 \
--image-height-stddev 30 \
--image-format png \
# text parameters
--synthetic-input-tokens-mean 100 \
--synthetic-input-tokens-stddev 0 \
--streaming
[!Note] Under the hood, GenAI-Perf generates synthetic images using a few source images under the
inputs/source_images
directory. If you would like to add/remove/edit the source images, you can do so by directly editing the source images under the directory. GenAI-Perf will pickup the images under the directory automatically when generating the synthetic images.
Approach 2: Bring Your Own Data (BYOD)#
[!Note] This approach only supports text and image inputs at the moment.
Instead of letting GenAI-Perf create the synthetic data,
you can also provide GenAI-Perf with your own data using
--input-file
CLI option.
The file needs to be in JSONL format and should contain both the prompt and
the filepath to the image to send.
For instance, an example of input file would look something as following:
// input.jsonl
{"text": "What is in this image?", "image": "path/to/image1.png"}
{"text": "What is the color of the dog?", "image": "path/to/image2.jpeg"}
{"text": "Describe the scene in the picture.", "image": "path/to/image3.png"}
...
After you create the file, you can run GenAI-Perf using the following command:
genai-perf profile \
-m <multimodal_model> \
--endpoint-type multimodal \
--input-file input.jsonl \
--streaming